Michal Batko
Masaryk University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michal Batko.
Information Systems | 2011
David Novak; Michal Batko; Pavel Zezula
Metric space is a universal and versatile model of similarity that can be applied in various areas of information retrieval. However, a general, efficient, and scalable solution for metric data management is still a resisting research challenge. We introduce a novel indexing and searching mechanism called Metric Index (M-Index) that employs practically all known principles of metric space partitioning, pruning, and filtering, thus reaching high search performance while having constant building costs per object. The heart of the M-Index is a general mapping mechanism that enables to actually store the data in established structures such as the B^+-tree or even in a distributed storage. We implemented the M-Index with the B^+-tree and performed experiments on two datasets-the first is an artificial set of vectors and the other is a real-life dataset composed of a combination of five MPEG-7 visual descriptors extracted from a database of up to several million digital images. The experiments put several M-Index variants under test and compare them with established techniques for both precise and approximate similarity search. The trials show that the M-Index outperforms the others in terms of efficiency of search-space pruning, I/O costs, and response times for precise similarity queries. Further, the M-Index demonstrates excellent ability to keep similar data close in the index which makes its approximation algorithm very efficient-maintaining practically constant response times while preserving a very high recall as the dataset grows and even beating approaches designed purely for approximate search.
acm international conference on digital libraries | 2007
Michal Batko; David Novak; Pavel Zezula
The similarity search has become a fundamental computational task in many applications. One of the mathematical models of the similarity - the metric space - has drawn attention of many researchers resulting in several sophisticated metric-indexing techniques. An important part of a research in this area is typically a prototype implementation and subsequent experimental evaluation of the proposed data structure. This paper describes an implementation framework called MESSIF that eases the task of building such prototypes. It provides a number of modules from basic storage management, over a wide support for distributed processing, to automatic collecting of performance statistics. Due to its open and modular design it is also easy to implement additional modules, if necessary. The MESSIF also offers several ready to use generic clients that allow to control and test the index structures.
scalable information systems | 2006
Michal Batko; David Novak; Fabrizio Falchi; Pavel Zezula
Due to the increasing complexity of current digital data, similarity search has become a fundamental computational task in many applications. Unfortunately, its costs are still high and the linear scalability of single server implementations prevents from efficient searching in large data volumes. In this paper, we shortly describe four recent scalable distributed similarity search techniques and study their performance of executing queries on three different datasets. Though all the methods employ parallelism to speed up query execution, different advantages for different objectives have been identified by experiments. The reported results can be exploited for choosing the best implementations for specific applications. They can also be used for designing new and better indexing structures in the future.
similarity search and applications | 2009
David Novak; Michal Batko
Metric space as a universal and versatile model of similarity can be applied in various areas of non-text information retrieval. However, a general, efficient and scalable solution for metric data management is still a resisting research challenge. We introduce a novel indexing and searching mechanism called Metric Index (M-Index), that employs practically all known principles of metric space partitioning, pruning and filtering. The heart of the M-Index is a general mapping mechanism that enables to actually store the data in well-established structures such as the B+-tree or even in a distributed storage. We have implemented the M-Index with B+-tree and performed experiments on a combination of five MPEG-7 descriptors in a database of hundreds of thousands digital images. The experiments put under test several M-Index variants and compare them with two orthogonal approaches – the PM-Tree and the iDistance. The trials show that the M-Index outperforms the others in terms of efficiency of search-space pruning, I/O costs, and response times for precise similarity queries. Furthermore, the M-Index demonstrates an excellent ability to keep similar data close in the index which makes its approximation algorithm very efficient – maintaining practically constant response times while preserving a very high recall as the dataset grows.
DELOS'04 Proceedings of the 6th Thematic conference on Peer-to-Peer, Grid, and Service-Orientation in Digital Library Architectures | 2004
Michal Batko; Claudio Gennaro; Pavel Zezula
Similarity search in metric spaces represents an important paradigm for content-based retrieval of many applications. Existing centralized search structures can speed-up retrieval, but they do not scale up to large volume of data because the response time is linearly increasing with the size of the searched file. The proposed GHT* index is a scalable and distributed structure. By exploiting parallelism in a dynamic network of computers, the GHT* achieves practically constant search time for similarity range queries in data-sets of arbitrary size. The structure also scales well with respect to the growing volume of retrieved data. Moreover, a small amount of replicated routing information on each server increases logarithmically. At the same time, the potential for interquery parallelism is increasing with the growing data-sets because the relative number of servers utilized by individual queries is decreasing. All these properties are verified by experiments on a prototype system using real-life data-sets.
theory and practice of digital libraries | 2011
Michal Batko; Pavel Zezula
In all subfields of information retrieval, test datasets and ground truth data are important tools for testing and comparison of new search methods. This is also reflected by the image retrieval community where several benchmarking activities have been created in past years. However, the number of available test collections is still rather small and the existing ones are often limited in size or accessible only to the participants of benchmarking competitions. In this work, we present a new freely-available large-scale dataset for evaluation of content-based image retrieval systems. The dataset consists of 20 million high-quality images with five visual descriptors and rich and systematic textual annotations, a set of 100 test query objects and a semi-automatically collected ground truth data verified by users. Furthermore, we provide services that enable exploitation and collaborative expansion of the ground truth.
databases, information systems, and peer-to-peer computing | 2004
Michal Batko; Claudio Gennaro; Pavel Zezula
Similarity search in metric spaces represents an important paradigm for content-based retrieval in many applications. Existing centralized search structures can speed-up retrieval, but they do not scale up to large volume of data because the response time is linearly increasing with the size of the searched file. In this article, we study the problem of executing the nearest neighbor(s) queries in a distributed metric structure, which is based on the P2P communication paradigm and the generalized hyperplane partitioning. By exploiting parallelism in a dynamic network of computers, the query execution scales up very well considering both the number of distance computations and the hop count between the peers. Results are verified by experiments on real-life data sets.
Future Generation Computer Systems | 2008
Michal Batko; David Novak; Fabrizio Falchi; Pavel Zezula
Due to the increasing complexity of current digital data, similarity search has become a fundamental computational task in many applications. Unfortunately, its costs are still high and grow linearly on single server structures, which prevents them from efficient application on large data volumes. In this paper, we shortly describe four recent scalable distributed techniques for similarity search and study their performance in executing queries on three different datasets. Though all the methods employ parallelism to speed up query execution, different advantages for different objectives have been identified by experiments. The reported results would be helpful for choosing the best implementations for specific applications. They can also be used for designing new and better indexing structures in the future.
Information Processing and Management | 2012
David Novak; Michal Batko; Pavel Zezula
Metric space is a universal and versatile model of similarity that can be applied in various areas of non-text information retrieval. However, a general, efficient and scalable solution for metric data management is still a resisting research challenge. In this work, we try to make an important step towards such management system that would be able to scale to data collections of billions of objects. We propose a distributed index structure for similarity data management called the Metric Index (M-Index) which can answer queries in precise and approximate manner. This technique can take advantage of any distributed hash table that supports interval queries and utilize it as an underlying index. We have performed numerous experiments to test various settings of the M-Index structure and we have proved its usability by developing a full-featured publicly-available Web application.
similarity search and applications | 2009
Michal Batko; Petra Kohoutkova; David Novak
The Content-based Photo Image Retrieval (CoPhIR) dataset is the largest available database of digital images with corresponding visual descriptors. It contains five MPEG-7 global descriptors extracted from more than 106 million images from Flickr photo-sharing system. In this paper, we analyze this dataset focusing on 1) efficiency of similarity-based indexing and searching and on 2) expressiveness of combination of the descriptors with respect to subjective perception of visual similarity. We treat the descriptors as metric spaces and then combine them into a multi-metric space. We analyze distance distributions of individual descriptors, measure intrinsic dimensionality of these datasets and statistically evaluate correlation between these descriptors. Further, we use two methods to assess subjective accuracy and satisfaction of similarity retrieval based on a combination of descriptors that is recommended for CoPhIR, and we compare these results on databases of 10 and 100 million CoPhIR images. Finally, we suggest, explore and evaluate two approaches to improve the accuracy: 1) applying logarithms in order to weaken influence of a single descriptor contribution if it deviates from the rest, and 2) the possibility of categorization of the dataset and identifying visual characteristics important for individual categories.