Andreas Broschart
Max Planck Society
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Andreas Broschart.
string processing and information retrieval | 2007
Ralf Schenkel; Andreas Broschart; Seung-won Hwang; Martin Theobald; Gerhard Weikum
In addition to purely occurrence-based relevance models, term proximity has been frequently used to enhance retrieval quality of keyword-oriented retrieval systems. While there have been approaches on effective scoring functions that incorporate proximity, there has not been much work on algorithms or access methods for their efficient evaluation. This paper presents an efficient evaluation framework including a proximity scoring function integrated within a top-k query engine for text retrieval. We propose precomputed and materialized index structures that boost performance. The increased retrieval effectiveness and efficiency of our framework are demonstrated through extensive experiments on a very large text benchmark collection. In combination with static index pruning for the proximity lists, our algorithm achieves an improvement of two orders of magnitude compared to a term-based top-k evaluation, with a significantly improved result quality.
ACM Transactions on Information Systems | 2012
Andreas Broschart; Ralf Schenkel
Term proximity scoring is an established means in information retrieval for improving result quality of full-text queries. Integrating such proximity scores into efficient query processing, however, has not been equally well studied. Existing methods make use of precomputed lists of documents where tuples of terms, usually pairs, occur together, usually incurring a huge index size compared to term-only indexes. This article introduces a joint framework for trading off index size and result quality, and provides optimization techniques for tuning precomputed indexes towards either maximal result quality or maximal query processing performance under controlled result quality, given an upper bound for the index size. The framework allows to selectively materialize lists for pairs based on a query log to further reduce index size. Extensive experiments with two large text collections demonstrate runtime improvements of more than one order of magnitude over existing text-based processing techniques with reasonable index sizes.
international acm sigir conference on research and development in information retrieval | 2008
Andreas Broschart; Ralf Schenkel
Proximity-aware scoring functions lead to significant effectiveness improvements for text retrieval. For XML IR, we can sometimes enhance the retrieval quality by exploiting knowledge about the document structure combined with established text IR methods. This paper introduces modified proximity scores that take the document structure into account and demonstrates the effect for the INEX benchmark.
Advances in Focused Retrieval | 2009
Andreas Broschart; Ralf Schenkel; Martin Theobald
Proximity enhanced scoring models significantly improve retrieval quality in text retrieval. For XML IR, we can sometimes enhance the retrieval efficacy by exploiting knowledge about the document structure combined with established text IR methods. This paper elaborates on our approach used for INEX 2008 which modifies a proximity scoring model from text retrieval for usage in XML IR and extends it by taking the document structure information into account.
european conference on information retrieval | 2010
Andreas Broschart; Klaus Berberich; Ralf Schenkel
This paper evaluates the potential impact of explicit phrases on retrieval quality through a case study with the TREC Terabyte benchmark. It compares the performance of user- and system-identified phrases with a standard score and a proximity-aware score, and shows that an optimal choice of phrases, including term permutations, can significantly improve query performance.
INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval | 2009
Andreas Broschart; Ralf Schenkel
Scoring models that make use of proximity information usually improve result quality in text retrieval. Considering that index structures carrying proximity information can grow huge in size if they are not pruned, it is helpful to tune indexes towards space requirements and retrieval quality. This paper elaborates on our approach used for INEX 2009 to tune index structures for different choices of result size k. Our best tuned index structures provide the best CPU times for type A queries among the Efficiency Track participants, still providing at least BM25 retrieval quality. Due to the number of query terms, Type B queries cannot be processed equally performant. To allow for comparison as to retrieval quality with non-pruned index structures, we also depict our results from the Adhoc Track.
Focused Access to XML Documents | 2008
Andreas Broschart; Ralf Schenkel; Martin Theobald; Gerhard Weikum
This paper describes the setup and results of the Max- Planck-Institut fur Informatiks contributions for the INEX 2007 AdHoc Track task. The runs were produced with TopX, a search engine for ranked retrieval of XML data that supports a probabilistic scoring model for full-text content conditions and tag-term combinations, path conditions as exact or relaxable constraints, and ontology-based relaxation of terms and tag names.
International Workshop of the Initiative for the Evaluation of XML Retrieval | 2006
Martin Theobald; Andreas Broschart; Ralf Schenkel; Silvana Solomon; Gerhard Weikum
This paper describes the setup and results of the Max-Planck-Institut fur Informatik’s contributions for the INEX 2006 AdHoc Track and Feedback task. The runs were produced with the TopX system, which is a top-k retrieval engine for text and XML data that uses a combination of BM25-based content and structural scores.
international acm sigir conference on research and development in information retrieval | 2011
Andreas Broschart; Ralf Schenkel
Query processing with precomputed term pair lists can improve efficiency for some queries, but suffers from the quadratic number of index lists that need to be read. We present a novel hybrid index structure that aims at decreasing the number of index lists retrieved at query processing time, trading off a reduced number of index lists for an increased number of bytes to read. Our experiments demonstrate significant cold-cache performance gains of almost 25% on standard benchmark queries.
Archive | 2012
Andreas Broschart; Ralf Schenkel; Torsten Suel
In the presence of growing data, the need for efficient query processing under result quality and index size control becomes more and more a challenge to search engines. We show how to use proximity scores to make query processing effective and efficient with focus on either of the optimization goals. More precisely, we make the following contributions: • We present a comprehensive comparative analysis of proximity score models and a rigorous analysis of the potential of phrases and adapt a leading proximity score model for XML data. • We discuss the feasibility of all presented proximity score models for top-k query processing and present a novel index combining a content and proximity score that helps to accelerate top-k query processing and improves result quality. • We present a novel, distributed index tuning framework for term and term pair index lists that optimizes pruning parameters by means of well-defined optimization criteria under disk space constraints. Indexes can be tuned with emphasis on efficiency or effectiveness: the resulting indexes yield fast processing at high result quality. • We show that pruned index lists processed with a merge join outperform top-k query processing with unpruned lists at a high result quality. • Moreover, we present a hybrid index structure for improved cold cache run times. Angesichts wachsender Datenmengen stellt effiziente Anfrageverarbeitung, die gleichzeitig Ergebnisqualitat und Indexgrose berucksichtigt, zusehends eine Herausforderung fur Suchmaschinen dar. Wir zeigen, wie man Proximityscores einsetzen kann, um Anfragen effektiv und effizient zu verarbeiten, wobei der Schwerpunkt auf eines der Ziele gelegt wird. Die Hauptbeitrage dieser Arbeit gliedern sich wie folgt: • Wir prasentieren eine umfassende vergleichende Analyse von Proximityscoremodellen sowie eine grundliche Analyse des Potenzials von Phrasen und passen ein fuhrendes Proximityscoremodell fur die Verwendung mit XML-Daten an. • Wir diskutieren fur die prasentierten Proximityscoremodelle die Eignung zur Top-k-Anfrageverarbeitung und prasentieren einen neuen Index, der einen Inhalts- und Proximityscore kombiniert, um Top-k-Anfrageverarbeitung zu beschleunigen und die Gute zu verbessern. • Wir prasentieren ein neues, verteiltes Indextuningpaket fur Term- und Termpaarlisten, das Tuningparameter mittels wohldefinierter Optimierungskriterien unter Grosenbeschrankung bestimmt. Indizes konnen auf Effizienz oder Gute optimiert werden und sind bei hoher Gute performant. • Wir zeigen, dass gekurzte Indizes mit einem Merge Join-Ansatz Top-k Algorithmen mit ungekurzten Indizes bei hoher Gute schlagen. • Auserdem prasentieren wir eine hybride Indexstruktur, die Cold Cache-Effizienz verbessert.