Tobias Berka | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tobias Berka is active.

Explore More

Publication

Featured researches published by Tobias Berka.

Archive | 2014

Dimensionality Reduction for Information Retrieval Using Vector Replacement of Rare Terms

Tobias Berka; Marián Vajteršic

Dimensionality reduction by algebraic methods is an established technique to address a number of problems in information retrieval. In this chapter, we introduce a new approach to dimensionality reduction for text retrieval. According to Zipf’s law, the majority of indexing terms occurs only in a small number of documents. Our new algorithm exploits this observation to compute a dimensionality reduction. It replaces rare terms by computing a vector which expresses their semantics in terms of common terms. This process produces a projection matrix, which can be applied to a corpus matrix and individual document and query vectors. We give an accurate mathematical and algorithmic description of our algorithms and present an initial experimental evaluation on two benchmark corpora. These experiments indicate that our algorithm can deliver a substantial reduction in the number of features, from 8,742 to 500 and from 47,236 to 392 features, while preserving or even improving the retrieval performance.

Journal of Parallel and Distributed Computing | 2013

Parallel rare term vector replacement: Fast and effective dimensionality reduction for text

Tobias Berka; Marián Vajteršic

Dimensionality reduction is an established area in text mining and information retrieval. These methods convert the highly sparse corpus matrices into dense matrix format while preserving or improving the classification accuracy or retrieval performance. In this paper, we describe a novel approach to dimensionality reduction for text, along with a parallel algorithm suitable for private memory parallel computer systems. According to Zipfs law, the majority of indexing terms occurs only in a small number of documents. Our algorithm replaces rare terms by computing a vector which expresses their semantics in terms of common terms. This process produces a projection matrix, which can be applied to a corpus matrix and individual document and query vectors. We give an accurate mathematical and algorithmic description of our algorithms and present an experimental evaluation on two benchmark corpora. These experiments indicate that our algorithm can deliver a substantial reduction in the number of features, from 47,236 to 392 features on the Reuters corpus with a clear improvement in the retrieval performance. We have evaluated our parallel implementation using the message passing interface with up to 32 processes on a Nehalem Xeon cluster, computing the projection matrix for the dimensionality reduction for over 800,000 documents in just under 100 s.

parallel processing and applied mathematics | 2011

Portable explicit threading and concurrent programming for MPI applications

Tobias Berka; Helge Hagenauer; Marián Vajteršic

New applications for parallel computing in todays data centers, such as online analytical processing, data mining or information retrieval, require support for concurrency. Due to online query processing and multi-user operation, we need to concurrently maintain and analyze the data. While the Portable Operating System Interface (POSIX) defines a thread interface that is widely available, and while modern implementations of the Message Passing Interface (MPI) support threading, this combination is lacking in safety, security and reliability. The development of such parallel applications is therefore complex, difficult and error-prone. In response to this, we propose an additional layer of middleware for threaded MPI applications designed to simplify the development of concurrent parallel programs. We formulate a list of requirements and sketch a design rationale for such a library. Based on a prototype implementation, we evaluate the run-time overhead to estimate the overhead caused by the additional layer of indirection.

international conference on intelligent computer communication and processing | 2010

Fast distributed image retrieval for e-science grids: Motivations and challenges

Tobias Berka; Rade Kutil; Marián Vajteršic

In research, grid computing is an established way of providing computer resources for information retrieval. However, e-science grids also contain, process and produce documents - thereby acting as digital libraries and requiring means for information discovery. In this paper, we discuss how distributed information retrieval can be integrated into the Open Grid Service Architecture (OGSA) to efficiently provide image retrieval for e-science grids. We identify two fundamental ways of performing information retrieval on the grid - as a batch job or as a distributed activity - and argue the case for the latter for reasons of efficiency. We give an analysis of the theoretic communication and computation complexity and demonstrate that bandwidth limitations provide a decisive argument to support our case. We describe further design decisions for our system architecture and give a brief comparison with other designs reported in literature. Lastly, we describe how the statelessness and isolation of web services impede data-intensive, distributed, cross-site activities in OGSA grids, and how to escape them.

Procedia Computer Science | 2012

The Generalized Feed-forward Loop Motif: Definition, Detection and Statistical Significance

Tobias Berka

Abstract Network motifs play an important role in the qualitative analysis and quantitative characterization of networks. The feed-forward loop is a semantically important and statistically highly significant motif. In this paper, we extend the definition of the feed-forward loop to subgraphs of arbitrary size. To avoid the complexity of path enumeration, we define generalized feed-forward loops as pairs of source and target nodes that have two or more internally disjoint connecting paths. Based on this definition, we formally derive an approach for the detection of this generalized motif. Our quantitative analysis demonstrates that generalized feed-forward loops up to a certain path length are statistically significant. Loops of greater size are statistically underrepresented and hence an anti-motif.

international conference on supercomputing | 2011

SRC: information retrieval as a persistent parallel service on supercomputer infrastructure

Tobias Berka; Marián Vajteršic

We seek to create a parallel search engine which outperforms conventional, loosely coupled distributed systems. We have (1) parallelized the vector space model with 120%-180% parallel efficiency, (2) introduced a highly parallel algorithm for text dimensionality reduction increasing the search accuracy measured with the mean average precision by 4.8 percentage points on the Reuters corpus and (3) developed a middleware for concurrent programming in parallel applications for index maintenance and multi-user operation. Using these building blocks, we present an overall system architecture that addresses the requirements of information retrieval as a persistently deployed parallel service.

international conference on parallel processing | 2011

A Middleware for Concurrent Programming in MPI Applications

Tobias Berka; Helge Hagenauer; Marián Vajteršic

A wide range of computationally intensive applications such as information retrieval, on-line analytical processing and data mining inherently require concurrency, because concurrent data maintenance, query processing and multi-user operation are functional requirements. Therefore, concurrent programming is a prerequisite for such systems. However, existing tools for parallel programming fail to meet these demands for concurrency and the adoption of parallel processing for these application domains is thus hindered. In this paper, we discuss the use of threads and concurrent programming constructs in the state of the art in parallel programming tools and environments. We find that the necessary functionality is available, but often in an inconvenient and unreliable manner. Due to the fact that the programmability and maintainability of parallel programs is a major concern, we consider the existing solutions inadequate or insufficient. We argue that an additional layer of middleware for threads and inter-thread communication and synchronization is necessary to support the effective development of persistently deployed parallel services for our targeted application domain and present the MPI Threads (MPIT) interface specification. We give several real-world examples to demonstrate its use and present performance benchmarks to illustrate the cost of the additional layer of indirection.

Computing and Informatics \/ Computers and Artificial Intelligence | 2012