Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Rainer Hoch is active.

Publication


Featured researches published by Rainer Hoch.


international conference on document analysis and recognition | 1999

On the evaluation of document analysis components by recall, precision, and accuracy

Markus Junker; Rainer Hoch; Andreas Dengel

In document analysis, it is common to prove the usefulness of a component by an experimental evaluation. By applying the respective algorithms to a test sample, effectiveness measures such as recall, precision, and accuracy are computed. The goal of such an evaluation is two-fold: on the one hand it shows that the absolute effectiveness of the algorithm is acceptable for practical use. On the other hand the evaluation can prove that the algorithm has a better or worse effectiveness than another algorithm. We argue that the experimental evaluation on relative small test sets-as is very common in document analysis has to be taken with extreme care from a statistical point of view. In fact, it is surprising how weak statements derived from such evaluations are.


international acm sigir conference on research and development in information retrieval | 1994

Using IR techniques for text classification in document analysis

Rainer Hoch

This paper presents the INFOCLAS system applying statistical methods of information retrieval for the classification of German business letters into corresponding message types such as order, offer, enclosure, etc. INFOCLAS is a first step towards the understanding of documents proceeding to a classification-driven extraction of information. The system is composed of two main modules: the central indexer (extraction and weighting of indexing terms) and the classifier (classification of business letters into given types). The system employs several knowledge sources including a letter database, word frequency statistics for German, lists of message type specific words, morphological knowledge as well as the underlying document structure. As output, the system evaluates a set of weighted hypotheses about the type of the actual letter. Classification of documents allow the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis.1


International Journal on Document Analysis and Recognition | 1998

An experimental evaluation of OCR text representations for learning document classifiers

Markus Junker; Rainer Hoch

Abstract. In the literature, many feature types are proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not yet been done. In particular, evaluations on OCR documents are very rare. In this paper we investigate seven text representations based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts.


international conference on document analysis and recognition | 1997

Evaluating OCR and non-OCR text representations for learning document classifiers

Markus Junker; Rainer Hoch

In the literature, many feature types and learning algorithms have been proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not been done yet. In order to investigate different text representations for document classification, we have developed a tool which transforms documents into feature-value representations that are suitable for standard learning algorithms. In this paper, we investigate seven document representations for German texts based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts.


International Journal of Pattern Recognition and Artificial Intelligence | 1996

On virtual partitioning of large dictionaries for contextual post-processing to improve character recognition

Rainer Hoch; Thomas Kieninger

This article presents a new approach to the partitioning of large dictionaries by virtual views. The basic idea is that additional knowledge sources of text recognition and text analysis are employed for fast dictionary look-up in order to prune search space through static or dynamic views. The heart of the system is a redundant hashing technique which involves a set of hash functions dealing with noisy input efficiently. Currently, the system is composed of two main system components: the dictionary generator and the dictionary controller. While the dictionary generator initially builds the system by using profiles and source dictionaries, the controller allows the flexible integration of different search heuristics. Results prove that our system achieves a significant speed-up of dictionary access time.


Archive | 1995

Document analysis at DFKI. - Part 2: Information extraction

Stephan Baumann; Michael Malburg; Hans-Günther Hein; Rainer Hoch; Thomas Kieninger; Norbert Kuhn

Document analysis is responsible for an essential progress in office automation. This paper is part of an overview about the combined research efforts in document analysis at DFKI. Common to all document analysis projects is the global goal of providing a high level electronic representation of documents in terms of iconic, structural, textual, and semantic information. These symbolic document descriptions enable an “intelligent” access to a document database. Currently there are three ongoing document analysis projects at DFKI: INCA, OMEGA, and PASCAL2000/PASCAL+. Although the projects pursue different goals in different application domains, they all share the same problems which have to be resolved with similar techniques. For that reason the activities in these projects are bundled to avoid redundant work. At DFKI we have divided the problem of document analysis into two main tasks, text recognition and information extraction, which themselves are divided into a set of subtasks. In a series of three research reports the work of the document analysis and office automation department at DFKI is presented. The first report discusses the problem of text recognition, the second that of information extraction. In a third report we describe our concept for a specialized knowledge representation language for document analysis. The report in hand describes the activities dealing with the information extraction task. Information extraction covers the phases text analysis, message type identification and file integration.


international conference on pattern recognition | 1992

Fragmentary string matching by selective access to hybrid tries

Andreas Dengel; Adolf Pleyer; Rainer Hoch

The authors propose a dictionary look-up method as a contextual postprocessing for character hypotheses forming word candidates. In particular, a hybrid trie organization is combined with a selective-access-matrix (SAM) that allows an efficient matching of fragmentary input strings against legal words. Experiments prove that the method achieves some respectable results concerning speed. Furthermore, the additional memory needed for the SAM is smaller than the memory saved by the hybrid organization of the trie.<<ETX>>


international conference on pattern recognition | 1994

Using a partitioned dictionary for contextual post-processing of OCR-results

Rainer Hoch; Hans-Günther Hein; Thomas Kieninger

This paper describes an approach for the partitioning of large dictionaries which can be used in document analysis. It introduces a concept of virtual views on the dictionary. The architecture of the dictionary system is based on redundant hashing techniques. The system distinguishes between the two main modules: the dictionary generator and the dictionary controller. Our tests comparing the dictionary with standard UNIX utilities show that dictionary look-up is very fast.


international conference on document analysis and recognition | 1995

READLEX: a lexicon for the recognition and analysis of structured documents

Rainer Hoch

This paper describes the architecture of a lexicon system called READLEX dealing with requirements of both text recognition and text analysis in document analysis. In order to meet these requirements, we have developed a concept for the automatic acquisition and generation of the lexicon. The heart of the lexicon system is based on redundant hash addressing techniques. Currently, the lexicon is used for the contextual post-processing of OCR results as well as the categorization of texts within structured documents. Other components for document analysis such as the address parser and a text pattern matcher also make use of the lexicon.


Archive | 1997

Techniques for improving OCR results

Andreas Dengel; Rainer Hoch; M. M. Abraham; Zhang Guozhen; Michael Malburg; Achim Weigel

Collaboration


Dive into the Rainer Hoch's collaboration.

Researchain Logo
Decentralizing Knowledge