Arnd Christian König

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Arnd Christian König is active.

Explore More

Publication

Featured researches published by Arnd Christian König.

very large data bases | 2011

Fast set intersection in memory

Bolin Ding; Arnd Christian König

Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a worst-case efficient way. In general, given k (preprocessed) sets, with totally n elements, we will show how to compute their intersection in expected time [EQUATION], where r is the intersection size and w is the number of bits in a machine-word. In addition, we introduce a very simple version of this algorithm that has weaker asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads.

international world wide web conferences | 2009

Exploiting web search engines to search structured databases

Sanjay Agrawal; Kaushik Chakrabarti; Surajit Chaudhuri; Venkatesh Ganti; Arnd Christian König; Dong Xin

Web search engines often federate many user queries to relevant structured databases. For example, a product related query might be federated to a product database containing their descriptions and specifications. The relevant structured data items are then returned to the user along with web search results. However, each structured database is searched in isolation. Hence, the search often produces empty or incomplete results as the database may not contain the required information to answer the query. In this paper, we propose a novel integrated search architecture. We establish and exploit the relationships between web search results and the items in structured databases to identify the relevant structured data items for a much wider range of queries.Our architecture leverages existing search engine components to implement this functionality at very low overhead. We demonstrate the quality and efficiency of our techniques through an extensive experimental study.

Communications of The ACM | 2011

Theory and applications of b -bit minwise hashing

Ping Li; Arnd Christian König

Efficient (approximate) computation of set similarity in very large datasets is a common task with many applications in information retrieval and data management. One common approach for this task is minwise hashing. This paper describes b-bit minwise hashing, which can provide an order of magnitude improvements in storage requirements and computational overhead over the original scheme in practice. We give both theoretical characterizations of the performance of the new algorithm as well as a practical evaluation on large real-life datasets and show that these match very closely. Moreover, we provide a detailed comparison to other important alternative techniques proposed for estimating set similarities. Our technique yields a very simple algorithm and can be realized with only minor modifications to the original minwise hashing scheme.

very large data bases | 2012

Robust estimation of resource consumption for SQL queries using statistical techniques

Jiexing Li; Arnd Christian König; Vivek R. Narasayya; Surajit Chaudhuri

The ability to estimate resource consumption of SQL queries is crucial for a number of tasks in a database system such as admission control, query scheduling and costing during query optimization. Recent work has explored the use of statistical techniques for resource estimation in place of the manually constructed cost models used in query optimization. Such techniques, which require as training data examples of resource usage in queries, offer the promise of superior estimation accuracy since they can account for factors such as hardware characteristics of the system or bias in cardinality estimates. However, the proposed approaches lack robustness in that they do not generalize well to queries that are different from the training examples, resulting in significant estimation errors. Our approach aims to address this problem by combining knowledge of database query processing with statistical models. We model resource-usage at the level of individual operators, with different models and features for each operator type, and explicitly model the asymptotic behavior of each operator. This results in significantly better estimation accuracy and the ability to estimate resource usage of arbitrary plans, even when they are very different from the training instances. We validate our approach using various large scale real-life and benchmark workloads on Microsoft SQL Server.

international acm sigir conference on research and development in information retrieval | 2007

Heavy-tailed distributions and multi-keyword queries

Surajit Chaudhuri; Kenneth Ward Church; Arnd Christian König; Liying Sui

Intersecting inverted indexes is a fundamental operation for many applications in information retrieval and databases. Efficient indexing for this operation is known to be a hard problem for arbitrary data distributions. However, text corpora used in Information Retrieval applications often have convenient power-law constraints (also known as Zipfs Law and long tails) that allow us to materialize carefully chosen combinations of multi-keyword indexes, which significantly improve worst-case performance without requiring excessive storage. These multi-keyword indexes limit the number of postings accessed when computing arbitrary index intersections. Our evaluation on an e-commerce collection of 20 million products shows that the indexes of up to four arbitrary keywords can be intersected while accessing less than 20% of the postings in the largest single-keyword index.

international conference on data engineering | 2004

SQLCM: a continuous monitoring framework for relational database engines

Surajit Chaudhuri; Arnd Christian König; Vivek R. Narasayya

The ability to monitor a database server is crucial for effective database administration. Todays commercial database systems support two basic mechanisms for monitoring: (a) obtaining a snapshot of counters to capture current state, and (b) logging events in the server to a table/file to capture history. We show that for a large class of important database administration tasks the above mechanisms are inadequate in functionality or performance. We present an infrastructure called SQLCM that enables continuous monitoring inside the database server and that has the ability to automatically take actions based on monitoring. We describe the implementation of SQLCM in Microsoft SQL Server and show how several common and important monitoring tasks can be easily specified in SQLCM. Our experimental evaluation indicates that SQLCM imposes low overhead on normal server execution end enables monitoring tasks on a production server that would be too expensive using todays monitoring mechanisms.

knowledge discovery and data mining | 2008

Entity categorization over large document collections

Venkatesh Ganti; Arnd Christian König; Rares Vernica

Extracting entities (such as people, movies) from documents and identifying the categories (such as painter, writer) they belong to enable structured querying and data analysis over unstructured document collections. In this paper, we focus on the problem of categorizing extracted entities. Most prior approaches developed for this task only analyzed the local document context within which entities occur. In this paper, we significantly improve the accuracy of entity categorization by (i) considering an entitys context across multiple documents containing it, and (ii) exploiting existing large lists of related entities (e.g., lists of actors, directors, books). These approaches introduce computational challenges because (a) the context of entities has to be aggregated across several documents and (b) the lists of related entities may be very large. We develop techniques to address these challenges. We present a thorough experimental study on real data sets that demonstrates the increase in accuracy and the scalability of our approaches.

web search and data mining | 2010

Precomputing search features for fast and accurate query classification

Venkatesh Ganti; Arnd Christian König; Xiao Li

Query intent classification is crucial for web search and advertising. It is known to be challenging because web queries contain less than three words on average, and so provide little signal to base classification decisions on. At the same time, the vocabulary used in search queries is vast: thus, classifiers based on word-occurrence have to deal with a very sparse feature space, and often require large amounts of training data. Prior efforts to address the issue of feature sparseness augmented the feature space using features computed from the results obtained by issuing the query to be classified against a web search engine. However, these approaches induce high latency, making them unacceptable in practice. In this paper, we propose a new class of features that realizes the benefit of search-based features without high latency. These leverage co-occurrence between the query keywords and tags applied to documents in search results, resulting in a significant boost to web query classification accuracy. By pre-computing the tag incidence for a suitably chosen set of keyword-combinations, we are able to generate the features online with low latency and memory requirements. We evaluate the accuracy of our approach using a large corpus of real web queries in the context of commercial search.

international conference on data engineering | 2009

A Data Structure for Sponsored Search

Arnd Christian König; Kenneth Ward Church; Martin Markov

Inverted files have been very successful for document retrieval, but sponsored search is different. Inverted files are designed to find documents that match the query (all the terms in the query need to be in the document, but not vice versa). For sponsored search, ads are associated with bids. When a user issues a search query, bids are typically matched to the query using broad-match semantics: all the terms in the bid need to be in the query (but not vice versa). This means that the roles of the query and the bid/document are reversed in sponsored search, in turn making standard retrieval techniques based on inverted indexes ill-suited for sponsored search. This paper proposes novel index structures and query processing algorithms for sponsored search. We evaluate these structures using a real corpus of 180 million advertisements.

Knowledge and Information Systems | 2012

Improving clustering by learning a bi-stochastic data similarity matrix

Fei Wang; Ping Li; Arnd Christian König; Muting Wan

An idealized clustering algorithm seeks to learn a cluster-adjacency matrix such that, if two data points belong to the same cluster, the corresponding entry would be 1; otherwise, the entry would be 0. This integer (1/0) constraint makes it difficult to find the optimal solution. We propose a relaxation on the cluster-adjacency matrix, by deriving a bi-stochastic matrix from a data similarity (e.g., kernel) matrix according to the Bregman divergence. Our general method is named the Bregmanian Bi-Stochastication (BBS) algorithm. We focus on two popular choices of the Bregman divergence: the Euclidean distance and the Kullback–Leibler (KL) divergence. Interestingly, the BBS algorithm using the KL divergence is equivalent to the Sinkhorn–Knopp (SK) algorithm for deriving bi-stochastic matrices. We show that the BBS algorithm using the Euclidean distance is closely related to the relaxed k-means clustering and can often produce noticeably superior clustering results to the SK algorithm (and other algorithms such as Normalized Cut), through extensive experiments on public data sets.

Explore More