Timo Raita
University of Turku
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Timo Raita.
Software - Practice and Experience | 1992
Timo Raita
Substring search is a common activity in computing. The fastest known search method is that of Boyer and Moore with the improvements introduced by Horspool. This paper presents a new implementation which takes advantage of the dependencies between the characters. The resulting code runs 25 per cent faster than the best currently‐known routine.
Journal of the Association for Information Science and Technology | 1998
Abraham Bookstein; Shmuel T. Klein; Timo Raita
Information Retrieval Systems identify content bearing words, and possibly also assign weights, as part of the process of formulating requests. For optimal retrieval efficiency, it is desirable that this be done automatically. This article defines the notion of serial clustering of words in text, and explores the value of such clustering as an indicator of a words bearing content. This approach is flexible in the sense that it is sensitive to context : a term may be assessed as content-bearing within one collection, but not another. Our approach, being numerical, may also be of value in assigning weights to terms in requests. Experimental support is obtained from natural text databases in three different languages.
IEEE Transactions on Signal Processing | 1991
Martti Juhola; Jyrki Katajainen; Timo Raita
Standard median filtering that is searched repeatedly for a median from a sample set which changes only slightly between the subsequent searches is discussed. Several well-known methods for solving this running median problem are reviewed, the (asymptotical) time complexities of the methods are analyzed, and simple variants are proposed which are especially suited for small sample sets, a frequent situation. Although the discussion is restricted to the one-dimensional case, the ideas are easily extended to higher dimensions. >
Information Retrieval | 2002
Abraham Bookstein; Vladimir A. Kulyukin; Timo Raita
Many problems in information retrieval and related fields depend on a reliable measure of the distance or similarity between objects that, most frequently, are represented as vectors. This paper considers vectors of bits. Such data structures implement entities as diverse as bitmaps that indicate the occurrences of terms and bitstrings indicating the presence of edges in images. For such applications, a popular distance measure is the Hamming distance. The value of the Hamming distance for information retrieval applications is limited by the fact that it counts only exact matches, whereas in information retrieval, corresponding bits that are close by can still be considered to be almost identical. We define a “Generalized Hamming distance” that extends the Hamming concept to give partial credit for near misses, and suggest a dynamic programming algorithm that permits it to be computed efficiently. We envision many uses for such a measure. In this paper we define and prove some basic properties of the “Generalized Hamming distance”, and illustrate its use in the area of object recognition. We evaluate our implementation in a series of experiments, using autonomous robots to test the measures effectiveness in relating similar bitstrings.
ACM Transactions on Information Systems | 1997
Abraham Bookstein; Shmuel T. Klein; Timo Raita
An earlier paper developed a procedure for compressing concordances, assuming that all alements occurred independently. The models introduced in that paper are extended here to take the possiblity of clustering into account. The concordance is conceptualized as a set of bitmaps, in which the bit locations reporesent documents, and the one-bits represent the occurrence of given terms. Hidden Markov Models (HMMs) are used to describe the clustering of the one-bits. However, for computational reasons, the HMM is approximated by traditional Markov models. A set of criteria is developed to constrain the allowable set of n-state models, and a full inventory is given for n ≤ 4. Graph-theoretic reduction and complementation operations are defined among the various models and are used to provide a structure relating the models studied. Finally, the new methods were tested on the concordances of the English Bible and of two of the worlds largest full-text retrieval systems: the Tre´sor de la Langue Franc¸aise and the Responsa Project.
Journal of the ACM | 1992
Jyrki Katajainen; Timo Raita
Text compression is often done using a fixed, previously formed dictionary (code book) that expresses which substrings of the text can be replaced by code words. There always exists an optimal solution for text-encoding problem. Due to the long processing times of the various optimal algorithms, several heuristics have been proposed in the literature. In this paper, the worst-case compression gains obtained by the longest match and the greedy heuristics for various types of dictionaries is studied. For general dictionaries, the performance of the heuristics can be almost the weakest possible. In practice, however, the dictionaries have usually properties that lead to a space-optimal or near-space-optimal coding result with the heuristics.
international acm sigir conference on research and development in information retrieval | 1987
Timo Raita; Jukka Teuhola
The knowledge of a short substring constitutes a good basis for guessing the next character in a natural language text. This observation, i.e. repeated guessing and encoding of subsequent characters, is very fundamental for the predictive text compression. The paper describes a family of such compression methods, using a hash table for searching the prediction information. The experiments show that the methods produce good compression gains and, moreover, are very fast. The one-pass versions are especially apt for “on-the-fly” compression of transmitted data, and could be a basis for specialized hardware.
Journal of the Association for Information Science and Technology | 2003
Abraham Bookstein; Vladimir A. Kulyukin; Timo Raita; John Nicholson
Automated information retrieval relies heavily on statistical regularities that emerge as terms are deposited to produce text. This paper examines statistical patterns expected of a pair of terms that are semantically related to each other. Guided by a conceptualization of the text generation process, we derive measures of how tightly two terms are semantically associated. Our main objective is to probe whether such measures yield reasonable results. Specifically, we examine how the tendency of a content bearing term to clump, as quantified by previously developed measures of term clumping, is influenced by the presence of other terms. This approach allows us to present a toolkit from which a range of measures can be constructed. As an illustration, one of several suggested measures is evaluated on a large text corpus built from an on-line encyclopedia.
data compression conference | 1992
Abraham Bookstein; Shmuel T. Klein; Timo Raita
The authors discuss concordance compression using the framework now customary in compression theory. They begin by creating a mathematical model of concordance generation, and then use optimal compression engines, such as Huffman or arithmetic coding, to do the actual compression. It should be noted that in the context of a static information retrieval system, compression and decompression are not symmetrical tasks. Compression is done only once, while building the system, whereas decompression is needed during the processing of every query and directly affects the response time. One may thus use extensive and costly preprocessing for compression, provided reasonably fast decompression methods are possible. Moreover, compression is applied to the full files (text, concordance, etc.), but decompression is needed only for (possibly many) short pieces, which may be accessed at random by means of pointers to their exact locations. Therefore the use of adaptive methods based on tables that systematically change from the beginning to the end of the file is ruled out. However, their concern is less the speed of encoding or decoding than relating concordance compression conceptually to the modern approach of data compression, and testing the effectiveness of their models.<<ETX>>
data compression conference | 1994
Abraham Bookstein; Shmuel T. Klein; Timo Raita
An earlier paper developed a procedure for compressing concordances, assuming that all elements occurred independently. In this paper, the earlier models are extended to take the possibility of clustering into account. The authors suggest several models adapted to concordances of large full-text information retrieval systems, which are generally subject to clustering.<<ETX>>