Helena Ahonen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Helena Ahonen is active.

Explore More

Publication

Featured researches published by Helena Ahonen.

Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98- | 1998

Applying data mining techniques for descriptive phrase extraction in digital document collections

Helena Ahonen; O. Heinonen; M. Klemettinen; A.I. Verkamo

Traditionally, texts have been analysed using various information retrieval-related methods, such as full-text analysis and natural language processing. However, only few examples of data mining in text, particularly in full text, are available. In this paper, we show that general data mining methods are applicable to text analysis tasks such as descriptive phrase extraction. Moreover, we present a general framework for text mining. The framework follows the general knowledge discovery process, thus containing steps from preprocessing to utilization of the results. The data mining method that we apply is based on generalized episodes and episode rules. We give concrete examples of how to preprocess texts based on the intended use of the discovered results and we introduce a weighting scheme that helps in pruning out redundant or non-descriptive phrases. We also present results from real-life data experiments.

international colloquium on grammatical inference | 1994

Forming Grammars for Structured Documents: an Application of Grammatical Inference

Helena Ahonen; Heikki Mannila; Erja Nikunen

We consider the problem of generating grammars for classes of structured documents — dictionaries, encyclopedias, user manuals, and so on — from examples. The examples consist of structures of individual documents, and they can be collected either by converting typographical tagging of documents prepared for printing into structural tags, or by using document recognition techniques. Our method forms first finite-state automata describing the examples completely. These automata are modified by considering certain context conditions; the modifications correspond to generalizing the underlying language. Finally, the automata are converted into regular expressions, and they are used to construct the grammar. In addition to automata, an alternative representation, characteristic k-grams, is introduced. Some interactive operations are also described that are necessary for generating a grammar for a large and complicated document.

european conference on principles of data mining and knowledge discovery | 1997

Mining in the Phrasal Frontier

Helena Ahonen; Oskari Heinonen; Mika Klemettinen; A. Inkeri Verkamo

Data mining methods have been applied to a wide variety of domains. Surprisingly enough, only a few examples of data mining in text are available. However, considering the amount of existing document collections, text mining would be most useful. Traditionally, texts have been analysed using various information retrieval related methods and natural language processing. In this paper, we present our first experiments in applying general methods of data mining to discovering phrases and co-occurring terms. We also describe the text mining process developed. Our results show that data mining methods — with appropriate preprocessing — can be used in text processing, and that by shifting the focus the process can be used to obtain results for various purposes.

PODP '96 Proceedings of the Third International Workshop on Principles of Document Processing | 1996

Disambiguation of SGML Content Models

Helena Ahonen

A Standard Generalized Markup Language (SGML) document has a document type definition (DTD) that specifies the allowed structures for the document. The basic components of a DTD are element declarations that contain for each element a content model, i.e., a regular expression that defines the allowed content for this element. The SGML standard requires that the content models of element declarations are unambiguous in the following sense: a content model is ambiguous if an element or character string occurring in the document instance can satisfy more than one primitive token in the content model without look-ahead. Brggemann-Klein and Wood have studied the unambiguity of content models, and they have presented an algorithm that decides whether a content model is unambiguous. In this paper we present a disambiguation algorithm that, based on the work of Brggemann-Klein and Wood, transforms an ambiguous content model into an unambiguous one by generalizing the language. We also present some experimental results obtained by our implementation of the algorithm in connection to an automatic DTD generation tool.

Mathematical and Computer Modelling | 1997

Generating grammars for SGML tagged texts lacking DTD

Helena Ahonen; Heikki Mannila; Erja Nikunen

We describe a technique for forming a context free grammar for a document that has some kind of tagging-structural or typographical-but no concise description of the structure is available. The technique is based on ideas from machine learning. It forms first a set of finite-state automata describing the document completely. These automata are modified by considering certain context conditions; the modifications correspond to generalizing the underlying languages. Finally, the automata are converted into regular expressions, which are then used to construct the grammar. An alternative representation, characteristic k-grams, is also introduced. Additionally, the paper describes some interactive operations necessary for generating a grammar for a large and complicated document.

international conference on electronic publishing | 1998

Design and Implementation of a Document Assembly Workbench

Helena Ahonen; Barbara Heikkinen; Oskari Heinonen; Jani Jaakkola; Pekka Kilpeläinen; Greger Lindén

Computers support the management of large collections of text documents, but efficient reuse of document collections for producing new documents remains inherently difficult. We describe and discuss the design and implementation of a document assembly system based on a document assembly model, where the user produces new specialized documents by querying and browsing a collection of structured document fragments.

database and expert systems applications | 1997

Assembling Documents from Digital Libraries

Helena Ahonen; Barbara Heikkinen; Oskari Heinonen; Pekka Kilpeläinen

We consider assembling documents using, as a source, a digital library containing SGML documents. The assembly process contains two parts: 1) finding interesting fragments, and 2) constructing a coherent document. We present a general document assembly framework. First, we describe a system for tailoring control engineering textbooks. Its assembling facilities are rather restricted but, on the other hand, the quality of documents produced is high. Second, we address the problem of filtering and combining interesting information from a large heterogeneous document collection. The methods presented offer various ways to find the interesting document fragments. Moreover, the elements found in the fragments are mapped to generic elements, like sections, paragraph containers, paragraphs and strings, which have known semantics. Hence, even arbitrary compositions can be formatted and printed.

international conference on electronic publishing | 1998

Analysis of Document Structures for Element Type Classification

Helena Ahonen; Barbara Heikkinen; Oskari Heinonen; Jani Jaakkola; Mika Klemettinen

As more and more digital documents become available for the public use from different sources, also the needs of the users increase. Seamless integration of heterogenous collections, e.g., a possibility to query and format documents in a uniform way, is one of these needs. Processing of documents is greatly enhanced if the structure of documents is explicitly represented by some standard (SGML, XML, HTML). Hence, the problem of integrating heterogenous structures has to be taken into consideration. We address this problem by introducing a classification method that acquires knowledge from document instances and their document type definitions, and uses this knowledge to attach a generic class to each SGML element type. The classification retains the tree hierarchy of elements. Although the structure is simplified, enough distinctions remain to facilitate versatile further processing, e.g., formatting. The class of an element type can be stored in the document type definition and, using the architectural form feature of SGML, the documents can be processed as virtual documents obeying a pre-defined generic DTD. The specific usages of the classification, in addition to formatting and querying, include assembly of new documents from existing document fragments and automatic generation of style sheet templates for original document type definitions. We have implemented the classification method and experimented with several document types.

Archive | 1994