Anselm Blumer | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anselm Blumer is active.

Explore More

Publication

Featured researches published by Anselm Blumer.

Journal of the ACM | 1989

Learnability and the Vapnik-Chervonenkis dimension

Anselm Blumer; Andrzej Ehrenfeucht; David Haussler; Manfred K. Warmuth

Valiants learnability model is extended to learning classes of concepts defined by regions in Euclidean space En. The methods in this paper lead to a unified treatment of some of Valiants results, along with previous results on distribution-free convergence of certain pattern recognition algorithms. It is shown that the essential condition for distribution-free learnability is finiteness of the Vapnik-Chervonenkis dimension, a simple combinatorial parameter of the class of concepts to be learned. Using this parameter, the complexity and closure properties of learnable classes are analyzed, and the necessary and sufficient conditions are provided for feasible learnability.

Theoretical Computer Science | 1985

THE SMALLEST AUTOMATON RECOGNIZING THE SUBWORDS OF A TEXT

Anselm Blumer; J. Blumer; David Haussler; Andrzej Ehrenfeucht; M. T. Chen; Joel I. Seiferas

Let a partial deterministic finite automaton be a DFA in which each state need not have a transition edge for each letter of the alphabet. We demonstrate that the smallest partial DFA for the set of all subwords of a given word w, Iwl>2, has at most 21w(-2 states and 3(wl-4 transition edges, independently of the alphabet size. We give an algorithm to build this smallest partial DFA from the input w on-line in linear time.

Journal of the ACM | 1987

Complete inverted files for efficient text retrieval and analysis

Anselm Blumer; J. Blumer; David Haussler; Ross M. McConnell; Andrzej Ehrenfeucht

Given a finite set of texts <italic>S</italic> = {<italic>w</italic>1, … , <italic>w<subscrpt>k</subscrpt></italic>} over some fixed finite alphabet &Sgr;, a complete inverted file for <italic>S</italic> is an abstract data type that provides the functions <italic>find</italic>(<italic>w</italic>), which returns the longest prefix of <italic>w</italic> that occurs (as a subword of a word) in <italic>S</italic>; <italic>freq</italic>(<italic>w</italic>), which returns the number of times <italic>w</italic> occurs in <italic>S</italic>; and <italic>locations</italic>(<italic>w</italic>), which returns the set of positions where <italic>w</italic> occurs in <italic>S</italic>. A data structure that implements a complete inverted file for <italic>S</italic> that occupies linear space and can be built in linear time, using the uniform-cost RAM model, is given. Using this data structure, the time for each of the above query functions is optimal. To accomplish this, techniques from the theory of finite automata and the work on suffix trees are used to build a deterministic finite automaton that recognizes the set of all subwords of the set <italic>S</italic>. This automaton is then annotated with additional information and compacted to facilitate the desired query functions. The result is a data structure that is smaller and more flexible than the suffix tree.

Discrete Applied Mathematics | 1989

Average sizes of suffix trees and dawgs

Anselm Blumer; Andrzej Ehrenfeucht; David Haussler

Abstract Suffix trees, directed acyclic word graphs (DAWGs) and related data structures are useful for text retrieval and analysis. Linear upper and lower bounds on their sizes are known. Constructing these data structures for random strings, one observes that the size does not increase smoothly, but oscillates between these bounds. We use Mellin transforms to obtain size estimates as integrals of meromorphic functions. Poles on the real axis lead to exact formulae for the average sizes, while poles with nonzero imaginary part lead to very good estimates of the oscillations.

IEEE Transactions on Information Theory | 1988

The Renyi redundancy of generalized Huffman codes

Anselm Blumer; Robert J. McEliece

Huffmans algorithm gives optimal codes, as measured by average codeword length, and the redundancy can be measured as the difference between the average codeword length and Shannons entropy. If the objective function is replaced by an exponentially weighted average, then a simple modification of Huffmans algorithm gives optimal codes. The redundancy can now be measured as the difference between this new average and A. Renyis (1961) generalization of Shannons entropy. By decreasing some of the codeword lengths in a Shannon code, the upper bound on the redundancy given in the standard proof of the noiseless source coding theorem is improved. The lower bound is improved by randomizing between codeword lengths, allowing linear programming techniques to be used on an integer programming problem. These bounds are shown to be asymptotically equal. The results are generalized to the Renyi case and are related to R.G. Gallagers (1978) bound on the redundancy of Huffman codes. >

symposium on the theory of computing | 1984

Building a complete inverted file for a set of text files in linear time

Anselm Blumer; J. Blumer; Andrzej Ehrenfeucht; David Haussler; Ross M. McConnell

Given a finite set of texts <italic>S</italic> &equil; {ω<subscrpt>1</subscrpt>, ..., ω<subscrpt>k</subscrpt>} over some fixed finite alphabet &Sgr;, a complete inverted file for <italic>S</italic> is an abstract data type that provides the functions <italic>find</italic>(ω), which returns the longest prefix of ω which occurs in <italic>S; freq</italic>(ω), which returns the number of times ω occurs in <italic>S;</italic> and <italic>locations</italic>(ω) which returns the set of positions at which ω occurs. We give a data structure to implement a complete inverted file for <italic>S</italic> which occupies linear space and can be built in linear time, using the uniform cost RAM model. Using this data structure, the time for each of the above query functions is optimal. To accomplish this, we use techniques from the theory of finite automata to build a deterministic finite automaton which recognizes the set of all sub words of the set <italic>S.</italic> This automaton is then annotated with additional information and compacted to facilitate the desired query functions.

international colloquium on automata, languages and programming | 1984

Building the Minimal DFA for the Set of all Subwords of a Word On-line in Linear Time

Anselm Blumer; J. Blumer; Andrzej Ehrenfeucht; David Haussler; Ross M. McConnell

Let a partial deterministic finite automaton be a DFA in which each state need not have a transition edge for each letter of the alphabet. We demonstrate that the minimal partial DFA for the set of all subwords of a given word w, |w| > 2, has at most 2|w| − 2 states and 3|w| − 4 transition edges, independently of the alphabet size. We give an algorithm to build this minimal partial DFA from the input w on-line in linear time.

IEEE Transactions on Information Theory | 1987

Minimax universal noiseless coding for unifilar and Markov sources (Corresp.)

Anselm Blumer

Constructive upper bounds are presented for minimax universal noiseless coding of unifilar sources without any ergodicity assumptionS. These bounds are obtained by quantizing the estimated probability distribution of source letters with respect to the relative entropy. They apply both to fixed-length to variable-length (FV) and variable-length to fixed-length (VF) codes. Unifilar sources are a generalization of the usual definition of Markov sources, so these results apply to Markov sources as well. These upper bounds agree asymptotically with the lower bounds given by Davisson for FV coding of stationary ergodic Markov sources.

Discrete Applied Mathematics | 1989

Learning faster than promised by the Vapnik-Chervonenkis dimension

Anselm Blumer; Nick Littlestone

Abstract We investigate the sample size needed to infer a separating line between two convex planar regions using Valiants model of the complexity of learning from random examples [4]. A theorem proved in [1] using the Vapnik-Chervonenkis dimension gives an O((1/e)ln(1/e)) upper bound on the sample size sufficient to infer a separating line with error less than e between two convex planar regions. This theorem requires that with high probability any separating line consistent with such a sample have small error. The present paper gives a lower bound showing that under this requirement the sample size cannot be improved. It is further shown that if this requirement is weakened to require only that a particular line which is tangent to the convex hulls of the sample points in the two regions have small error then the ln(1/e) term can be eliminated from the upper bound.

Sequence | 1990

Applications of DAWGs to data compression

Anselm Blumer

A string compression technique can compress well only if it has an accurate model of the data source. For a source with statistically independent characters, Huffman or arithmetic codes give optimal compression [11]. In this case it is straightforward to use a fixed source model if the statistics are known in advance, or to adapt the model to unknown or changing statistics. For the many sources which produce dependent characters, a more sophisticated source model can provide much better compression at the expense of the extra space and time for storing and maintaining the model. The space required by a straightforward implementation of a Markov model grows exponentially in the order of the model. The Directed Acyclic Word Graph (DAWG) can be built in linear time and space, and provides the information needed to obtain compression equal to that obtained using a Markov model of high order. This paper presents two algorithms for string compression using DAWGs. The first is a very simple idea which generalizes run-length coding. It obtains good compression in many cases, but is provably non-optimal. The second combines the main idea of the first with arithmetic coding, resulting in a great improvement in performance.

Explore More