Holger Bast | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Holger Bast is active.

Explore More

Publication

Featured researches published by Holger Bast.

international acm sigir conference on research and development in information retrieval | 2006

Type less, find more: fast autocompletion search with a succinct index

Holger Bast; Ingmar Weber

We consider the following full-text search autocompletion feature. Imagine a user of a search engine typing a query. Then with every letter being typed, we would like an instant display of completions of the last query word which would lead to good hits. At the same time, the best hits for any of these completions should be displayed. Known indexing data structures that apply to this problem either incur large processing times for a substantial class of queries, or they use a lot of space. We present a new indexing data structure that uses no more space than a state-of-the-art compressed inverted index, but with 10 times faster query processing times. Even on the large TREC Terabyte collection, which comprises over 25 million documents, we achieve, on a single machine and with the index on disk, average response times of one tenth of a second. We have built a full-fledged, interactive search engine that realizes the proposed autocompletion feature combined with support for proximity search, semi-structured (XML) text, subword and phrase completion, and semantic tags.

international acm sigir conference on research and development in information retrieval | 2007

ESTER: efficient search on text, entities, and relations

Holger Bast; Alexandru Chitea; Fabian M. Suchanek; Ingmar Weber

We present ESTER, a modular and highly efficient system for combined full-text and ontology search. ESTER builds on a query engine that supports two basic operations: prefix search and join. Both of these can be implemented very efficiently with a compact index, yet in combination provide powerful querying capabilities. We show how ESTER can answer basic SPARQL graph-pattern queries on the ontology by reducing them to a small number of these two basic operations. ESTER further supports a natural blend of such semantic queries with ordinary full-text queries. Moreover, the prefix search operation allows for a fully interactive and proactive user interface, which after every keystroke suggests to the user possible semantic interpretations of his or her query, and speculatively executes the most likely of these interpretations. As a proof of concept, we applied ESTER to the English Wikipedia, which contains about 3 million documents, combined with the recent YAGO ontology, which contains about 2.5 million facts. For a variety of complex queries, ESTER achieves worst-case query processing times of a fraction of a second, on a single machine, with an index size of about 4 GB.

very large data bases | 2008

TopX: efficient and versatile top-k query processing for semistructured data

Martin Theobald; Holger Bast; Debapriyo Majumdar; Ralf Schenkel; Gerhard Weikum

Recent IR extensions to XML query languages such as Xpath 1.0 Full-Text or the NEXI query language of the INEX benchmark series reflect the emerging interest in IR-style ranked retrieval over semistructured data. TopX is a top-k retrieval engine for text and semistructured data. It terminates query execution as soon as it can safely determine the k top-ranked result elements according to a monotonic score aggregation function with respect to a multidimensional query. It efficiently supports vague search on both content- and structure-oriented query conditions for dynamic query relaxation with controllable influence on the result ranking. The main contributions of this paper unfold into four main points: (1) fully implemented models and algorithms for ranked XML retrieval with XPath Full-Text functionality, (2) efficient and effective top-k query processing for semistructured data, (3) support for integrating thesauri and ontologies with statistically quantified relationships among concepts, leveraged for word-sense disambiguation and query expansion, and (4) a comprehensive description of the TopX system, with performance experiments on large-scale corpora like TREC Terabyte and INEX Wikipedia.

acm symposium on parallel algorithms and architectures | 1991

Fast and reliable parallel hashing

Holger Bast; Torben Hagerup

A perfect integers is an HOLGER BAST Fachbereich Informatik, Universit 5t des Saarlandes W–6600 Saarbriicken, Germany

international acm sigir conference on research and development in information retrieval | 2005

Why spectral retrieval works

Holger Bast; Debapriyo Majumdar

We argue that the ability to identify pairs of related terms is at the heart of what makes spectral retrieval work in practice. Schemes such as latent semantic indexing (LSI) and its descendants have this ability in the sense that they can be viewed as computing a matrix of term-term relatedness scores which is then used to expand the given documents (not the queries). For almost all existing spectral retrieval schemes, this matrix of relatedness scores depends on a fixed low-dimensional subspace of the original term space. We instead vary the dimension and study for each term pair the resultin curve of relatedness scores. We find that it is actually the shape of this curve which is indicative for the term-pair relatedness, and not any of the individual relatedness scores on the curve. We derive two simple, parameterless algorithms that detect this shape and that consistently outperform previous methods on a number of test collections. Our curves also shed light on the effectiveness of three fundamental types of variations of the basic LSI scheme.

conference on information and knowledge management | 2007

Efficient interactive query expansion with complete search

Holger Bast; Debapriyo Majumdar; Ingmar Weber

We present an efficient realization of the following interactive search engine feature: as the user is typing the query, words that are related to the last query word and that would lead to good hits are suggested, as well as selected such hits. The realization has three parts: (i) building clusters of related terms, (ii) adding this information as artificial words to the index such that (iii) the described feature reduces to an instance of prefix search and completion. An efficient solution for the latter is provided by the CompleteSearch engine, with which we have integrated the proposed feature. For building the clusters of related terms we propose a variant of latent semantic indexing that, unlike standard approaches, is completely transparent to the user. By experiments on two large test-collections, we demonstrate that the feature is provided at only a slight increase in query processing time and index size.

mathematical foundations of computer science | 1992

A Perfect Parallel Dictionary

Holger Bast; Martin Dietzfelbinger; Torben Hagerup

We describe new randomized parallel algorithms for the problems of interval allocation, construction of static dictionaries, and maintenance of dynamic dictionaries. All of our algorithms run optimally in constant time with high probability. Our main result is the construction of what we call a perfect dictionary, a scheme that allows p processors implementing a set M in space proportional to ¦M¦ to process batches of p insert, delete, and lookup instructions on M in constant time pet batch.

string processing and information retrieval | 2006

Output-Sensitive autocompletion search

Holger Bast; Christian Mortensen; Ingmar Weber

We consider the following autocompletion search scenario: imagine a user of a search engine typing a query; then with every keystroke display those completions of the last query word that would lead to the best hits, and also display the best such hits. The following problem is at the core of this feature: for a fixed document collection, given a set D of documents, and an alphabetical range W of words, compute the set of all word-in-document pairs (w,d) from the collection such that w ∈W and d∈D. We present a new data structure with the help of which such autocompletion queries can be processed, on the average, in time linear in the input plus output size, independent of the size of the underlying document collection. At the same time, our data structure uses no more space than an inverted index. Actual query processing times on a large test collection correlate almost perfectly with our theoretical bound.

EWMF'05/KDO'05 Proceedings of the 2005 joint international conference on Semantics, Web and Mining | 2005

Discovering a term taxonomy from term similarities using principal component analysis

Holger Bast; Georges Dupret; Debapriyo Majumdar; Benjamin Piwowarski

We show that eigenvector decomposition can be used to extract a term taxonomy from a given collection of text documents. So far, methods based on eigenvector decomposition, such as latent semantic indexing (LSI) or principal component analysis (PCA), were only known to be useful for extracting symmetric relations between terms. We give a precise mathematical criterion for distinguishing between four kinds of relations of a given pair of terms of a given collection: unrelated (car – fruit), symmetrically related (car – automobile), asymmetrically related with the first term being more specific than the second (banana – fruit), and asymmetrically related in the other direction (fruit – banana). We give theoretical evidence for the soundness of our criterion, by showing that in a simplified mathematical model the criterion does the apparently right thing. We applied our scheme to the reconstruction of a selected part of the open directory project (ODP) hierarchy, with promising results.

International Workshop on Challenges in Web Information Retrieval and Integration | 2005

Insights from Viewing Ranked Retrieval as Rank Aggregation

Holger Bast; Ingmar Weber

We view a variety of established methods for ranked retrieval from a common angle, namely as a process of combining query-independent rankings that were precomputed for certain attributes. Apart from a general insight into what effectively distinguishes various schemes from each other, we obtain three specific results concerned with concept-based retrieval. First, we prove that latent semantic indexing (LSI) can be implemented to answer queries in time proportional to the number of words in the query, which improves over the standard implementation by an order of magnitude; a similar result is established for LSIs probabilistic sibling PLSI. Second, we give a simple and precise characterization of the extent, to which latent semantic indexing (LSI) can deal with polysems, and when it fails to do so. Third, we demonstrate that the recombination of the intricate, yet relatively cheap mechanism of PLSI for mapping queries to attributes, with a simplistic, easy-to-compute set of document rankings gives a retrieval performance which is at least as good as that of the most sophisticated concept-based retrieval schemes and which does not require any precomputation

Explore More