Hinrich Schütze | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hinrich Schütze is active.

Explore More

Publication

Featured researches published by Hinrich Schütze.

Archive | 2008

Introduction to Information Retrieval: Scoring, term weighting, and the vector space model

Christopher D. Manning; Prabhakar Raghavan; Hinrich Schütze

Thus far, we have dealt with indexes that support Boolean queries: A document either matches or does not match a query. In the case of large document collections, the resulting number of matching documents can far exceed the number a human user could possibly sift through. Accordingly, it is essential for a search engine to rank-order the documents matching a query. To do this, the search engine computes, for each matching document, a score with respect to the query at hand. In this chapter, we initiate the study of assigning a score to a (query, document) pair. This chapter consists of three main ideas. We introduce parametric and zone indexes in Section 6.1, which serve two purposes. First, they allow us to index and retrieve documents by metadata, such as the language in which a document is written. Second, they give us a simple means for scoring (and thereby ranking) documents in response to a query. Next, in Section 6.2 we develop the idea of weighting the importance of a term in a document, based on the statistics of occurrence of the term. In Section 6.3, we show that by viewing each document as a vector of such weights, we can compute a score between a query and each document. This view is known as vector space scoring. Section 6.4 develops several variants of term-weighting for the vector space model. Chapter 7 develops computational aspects of vector space scoring and related topics.

international conference on computational linguistics | 2008

Stopping Criteria for Active Learning of Named Entity Recognition

Florian Laws; Hinrich Schütze

Active learning is a proven method for reducing the cost of creating the training sets that are necessary for statistical NLP. However, there has been little work on stopping criteria for active learning. An operational stopping criterion is necessary to be able to use active learning in NLP applications. We investigate three different stopping criteria for active learning of named entity recognition (NER) and show that one of them, gradient-based stopping, (i) reliably stops active learning, (ii) achieves nearoptimal NER performance, (iii) and needs only about 20% as much training data as exhaustive labeling.

Cognitive Science | 2010

Multilevel Exemplar Theory

Michael Walsh; Bernd Möbius; Travis Wade; Hinrich Schütze

This paper presents recent research that provides an overarching model of exemplar theory capable of explaining phenomena across the phonetic and syntactic strata. The model represents a unique exemplar-based account of constituency interactions encompassing both linguistic domains. It yields simulation and experimental results in keeping with experimental findings in the literature on syllable duration variability and offers an exemplar-theoretic account of local grammaticality. In addition, it provides some insights into the nature of exemplar cloud formation and demonstrates experimentally the potential gains that can be enjoyed via the use of rich exemplar representations.

conference on information and knowledge management | 2006

Performance thresholding in practical text classification

Hinrich Schütze; Emre Velipasaoglu; Jan O. Pedersen

In practical classification, there is often a mix of learnable and unlearnable classes and only a classifier above a minimum performance threshold can be deployed. This problem is exacerbated if the training set is created by active learning. The bias of actively learned training sets makes it hard to determine whether a class has been learned. We give evidence that there is no general and efficient method for reducing the bias and correctly identifying classes that have been learned. However, we characterize a number of scenarios where active learning can succeed despite these difficulties.

Computational Linguistics | 2007

Prepositional phrase attachment without oracles

Michaela Atterer; Hinrich Schütze

Work on prepositional phrase (PP) attachment resolution generally assumes that there is an oracle that provides the two hypothesized structures that we want to choose between. The information that there are two possible attachment sites and the information about the lexical heads of those phrases is usually extracted from gold-standard parse trees. We show that the performance of reattachment methods is higher with such an oracle than without. Because oracles are not available in NLP applications, this indicates that the current evaluation methodology for PP attachment does not produce realistic performance numbers. We argue that PP attachment should not be evaluated in isolation, but instead as an integral component of a parsing system, without using information from the gold-standard oracle.

Corpus Linguistics and Linguistic Theory | 2011

Asymmetry in corpus-derived and human word associations

Lukas Michelbacher; Stefan Evert; Hinrich Schütze

Abstract We investigate asymmetry in corpus-derived and human word associations. Most prior work has studied paradigmatic relations, either derived from free association norms or from large corpora using measures of statistical association and semantic relatedness. By contrast, we investigate the syntagmatic relation between words in adjective-noun and noun-noun combinations and present a new experimental design for measuring the strength of human associations. Of particular importance for syntagmatic relations are asymmetric associations, whose associational strength is much larger in one direction (e.g., from Pyrrhic to victory) than in the other (e.g., from victory to Pyrrhic). We develop a number of corpus-derived measures of asymmetric association and show that they predict the directedness of human associations with high accuracy.

arXiv: Computation and Language | 2016

Attention-Based Convolutional Neural Network for Machine Comprehension

Wenpeng Yin; Sebastian Ebert; Hinrich Schütze

Understanding open-domain text is one of the primary challenges in natural language processing (NLP). Machine comprehension benchmarks evaluate the systems ability to understand text based on the text content only. In this work, we investigate machine comprehension on MCTest, a question answering (QA) benchmark. Prior work is mainly based on feature engineering approaches. We come up with a neural network framework, named hierarchical attention-based convolutional neural network (HABCNN), to address this task without any manually designed features. Specifically, we explore HABCNN for this task by two routes, one is through traditional joint modeling of passage, question and answer, one is through textual entailment. HABCNN employs an attention mechanism to detect key phrases, key sentences and key snippets that are relevant to answering the question. Experiments show that HABCNN outperforms prior deep learning approaches by a big margin.

patent information retrieval | 2010

Preliminary study into query translation for patent retrieval

Charles Jochim; Christina Lioma; Hinrich Schütze; Steffen Koch; Thomas Ertl

Patent retrieval is a branch of Information Retrieval (IR) aiming to support patent professionals in retrieving patents that satisfy their information needs. Often, patent granting bodies require patents to be partially translated into one or more major foreign languages, so that language boundaries do not hinder their accessibility. This multilinguality of patent collections offers opportunities for improving patent retrieval. In this work we exploit these opportunities by applying query translation to patent retrieval. We expand monolingual patent queries with their translations, using both a domain-specific patent dictionary that we extract from the patent collection, and a general domain-free dictionary. Experimental evaluation on a standard CLEF-IP dataset shows that using either translation dictionary fetches similar results: query translation can help patent retrieval, but not always, and without great improvement compared to standard statistical monolingual query expansion (Rocchio). The improvement is greater when the source language is English, as opposed to French or German, a finding partly due to the effect of the complex French and German morphology upon translation accuracy, but also partly due to the prevalence of English in the collection. A thorough per-query analysis reveals that cases where standard query expansion fails (e.g. zero recall) can benefit from query translation.

Archive | 2008

Introduction to Information Retrieval: Text classification and Naive Bayes

Christopher D. Manning; Prabhakar Raghavan; Hinrich Schütze

Thus far, this book has mainly discussed the process of ad hoc retrieval , where users have transient information needs that they try to address by posing one or more queries to a search engine. However, many users have ongoing information needs. For example, you might need to track developments in multicore computer chips . One way of doing this is to issue the query multicore and computer and chip against an index of recent newswire articles each morning. In this and the following two chapters we examine the question: How can this repetitive task be automated? To this end, many systems support standing queries . A standing query is like any other query except that it is periodically executed on a collection to which new documents are incrementally added over time. If your standing query is just multicore and computer and chip, you will tend to miss many relevant new articles which use other terms such as multicore processors . To achieve good recall, standing queries thus have to be refined over time and can gradually become quite complex. In this example, using a Boolean search engine with stemming, you might end up with a query like (multicore or multi-core) and (chip or processor or microprocessor). To capture the generality and scope of the problem space to which standing queries belong, we now introduce the general notion of a classification problem. Given a set of classes , we seek to determine which class(es) a given object belongs to.

Archive | 2008

Introduction to Information Retrieval: Evaluation in information retrieval

Christopher D. Manning; Prabhakar Raghavan; Hinrich Schütze

We have seen in the preceding chapters many alternatives in designing an information retrieval (IR) system. How do we know which of these techniques are effective in which applications? Should we use stop lists? Should we stem? Should we use inverse document frequency weighting? IR has developed as a highly empirical discipline, requiring careful and thorough evaluation to demonstrate the superior performance of novel techniques on representative document collections. In this chapter, we begin with a discussion of measuring the effectiveness of IR systems (Section 8.1) and the test collections that are most often used for this purpose (Section 8.2). We then present the straightforward notion of relevant and nonrelevant documents and the formal evaluation methodology that has been developed for evaluating unranked retrieval results (Section 8.3). This includes explaining the kinds of evaluation measures that are standardly used for document retrieval and related tasks like text classification and why they are appropriate. We then extend these notions and develop further measures for evaluating ranked retrieval results (Section 8.4) and discuss developing reliable and informative test collections (Section 8.5). We then step back to introduce the notion of user utility, and how it is approximated by the use of document relevance (Section 8.6). The key utility measure is user happiness. Speed of response and the size of the index are factors in user happiness.

Explore More