Arturas Mazeika | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Arturas Mazeika is active.

Explore More

Publication

Featured researches published by Arturas Mazeika.

ACM Transactions on Database Systems | 2007

Estimating the selectivity of approximate string queries

Arturas Mazeika; Michael H. Böhlen; Nick Koudas; Divesh Srivastava

Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures. We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is independent of the number of database strings and linear with respect to the length of query string. We give a detailed empirical performance evaluation of our solution for synthetic and real-world datasets. We show that VSol is effective for large skewed databases of short strings.

very large data bases | 2009

SHARC: framework for quality-conscious web archiving

Dimitar Denev; Arturas Mazeika; Marc Spaniol; Gerhard Weikum

Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather sharp captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes. This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies towards better quality with given resources. We define quality measures, characterize their properties, and derive a suite of quality-conscious scheduling strategies for archive crawling. It is assumed that change rates of Web pages can be statistically predicted based on page types, directory depths, and URL names. We develop a stochastically optimal crawl algorithm for the offline case where all change rates are known. We generalize the approach into an online algorithm that detect information on a Web site while it is crawled. For dating a site capture and for assessing its quality, we propose several strategies that revisit pages after their initial downloads in a judiciously chosen order. All strategies are fully implemented in a testbed, and shown to be effective by experiments with both synthetically generated sites and a daily crawl series for a medium-sized site.

conference on information and knowledge management | 2011

Entity timelines: visual analytics and named entity evolution

Arturas Mazeika; Tomasz Tylenda; Gerhard Weikum

The constantly evolving Web reflects the evolution of society. Knowledge about entities (people, companies, political parties, etc.) evolves over time. Facts add up (e.g., awards, lawsuits, divorces), change (e.g., spouses, CEOs, political positions), and even cease to exist (e.g., countries split into smaller or join into bigger ones). Analytics of the evolution of the entities poses many challenges including extraction, disambiguation, and canonization of entities from large text collections as well as introduction of specific analysis and interactivity methods for the evolving entity data. In this demonstration proposal, we consider a novel problem of the evolution of named entities. To this end, we have extracted, disambiguated, canonicalized, and connected named entities with the YAGO ontology. To analyze the evolution we have developed a visual analytics system. Careful preprocessing and ranking of the ontological data allowed us to propose wide range of effective interactions and data analysis techniques including advanced filtering, contrasting timeliness of entities and drill down/roll up evolving data.

Visual Data Mining | 2008

Visual Data Mining: An Introduction and Overview

Simeon J. Simoff; Michael H. Böhlen; Arturas Mazeika

In our everyday life we interact with various information media, which present us with facts and opinions, supported with some evidence, based, usually, on condensed information extracted from data. It is common to communicate such condensed information in a visual form --- a static or animated, preferably interactive, visualisation. For example, when we watch familiar weather programs on the TV, landscapes with cloud, rain and sun icons and numbers next to them quickly allow us to build a picture about the predicted weather pattern in a region. Playing sequences of such visualisations will easily communicate the dynamics of the weather pattern, based on the large amount of data collected by many thousands of climate sensors and monitors scattered across the globe and on weather satellites. These pictures are fine when one watches the weather on Friday to plan what to do on Sunday --- after all if the patterns are wrong there are always alternative ways of enjoying a holiday. Professional decision making would be a rather different scenario. It will require weather forecasts at a high level of granularity and precision, and in real-time. Such requirements translate into requirements for high volume data collection, processing, mining, modelling and communicating the models quickly to the decision makers. Further, the requirements translate into high-performance computing with integrated efficient interactive visualisation. From practical point of view, if a weather pattern can not be depicted fast enough, then it has no value. Recognising the power of the human visual perception system and pattern recognition skills adds another twist to the requirements --- data manipulations need to be completed at least an order of magnitude faster than real-time in order to combine them with a variety of highly interactive visualisations, allowing easy remapping of data attributes to the features of the visual metaphor, used to present the data. In this few steps in the weather domain, we have specified some requirements towards a visual data mining system.

very large data bases | 2011

The SHARC framework for data quality in Web archiving

Dimitar Denev; Arturas Mazeika; Marc Spaniol; Gerhard Weikum

Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather coherent captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes. This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies toward better quality with given resources. We define data quality measures, characterize their properties, and develop a suite of quality-conscious scheduling strategies for archive crawling. Our framework includes single-visit and visit–revisit crawls. Single-visit crawls download every page of a site exactly once in an order that aims to minimize the “blur” in capturing the site. Visit–revisit strategies revisit pages after their initial downloads to check for intermediate changes. The revisiting order aims to maximize the “coherence” of the site capture(number pages that did not change during the capture). The quality notions of blur and coherence are formalized in the paper. Blur is a stochastic notion that reflects the expected number of page changes that a time-travel access to a site capture would accidentally see, instead of the ideal view of a instantaneously captured, “sharp” site. Coherence is a deterministic quality measure that counts the number of unchanged and thus coherently captured pages in a site snapshot. Strategies that aim to either minimize blur or maximize coherence are based on prior knowledge of or predictions for the change rates of individual pages. Our framework includes fairly accurate classifiers for change predictions. All strategies are fully implemented in a testbed and shown to be effective by experiments with both synthetically generated sites and a periodic crawl series for different Web sites.

international database engineering and applications symposium | 2006

Multi-dimensional Histograms with Tight Bounds for the Error

Linas Baltrunas; Arturas Mazeika; Michael H. Böhlen

Histograms are being used as non-parametric selectivity estimators for one-dimensional data. For high-dimensional data it is common to either compute one-dimensional histograms for each attribute or to compute a multi-dimensional equi-width histogram for a set of attributes. This either yields small low-quality or large high-quality histograms. In this paper we introduce HIRED (high-dimensional histograms with dimensionality reduction): small high-quality histograms for multi-dimensional data. HIRED histograms are adaptive, and they are based on the shape error and directional splits. The shape error permits a precise control of the estimation error of the histogram and, together with directional splits, yields a memory complexity that does not depend on the number of uniform attributes in the dataset. We provide extensive experimental results with synthetic and real world datasets. The experiments confirm that our method is as precise as state-of-the-art techniques and uses orders of magnitude less memory

advances in databases and information systems | 2008

Analysis and Interpretation of Visual Hierarchical Heavy Hitters of Binary Relations

Arturas Mazeika; Michael H. Böhlen; Daniel Trivellato

The emerging field of visual analytics changes the way we model, gather, and analyze data. Current data analysis approaches suggest to gather as much data as possible and then focus on goal and process oriented data analysis techniques. Visual analytics changes this approach and the methodology to interpret the results becomes the key issue. This paper contributes with a method to interpret visual hierarchical heavy hitters (VHHHs). We show how to analyze data on the general level and how to examine specific areas of the data. We identify five common patterns that build the interpretation alphabet of VHHHs. We demonstrate our method on three different real world datasets and show the effectiveness of our approach.

Lecture Notes in Computer Science | 2008

Using 2D Hierarchical Heavy Hitters to Investigate Binary Relationships

Daniel Trivellato; Arturas Mazeika; Michael H. Böhlen

This chapter presents VHHH: a visual data mining tool to compute and investigate hierarchical heavy hitters (HHHs) for two-dimensional data. VHHH computes the HHHs for a two-dimensional categorical dataset and a given threshold, and visualizes the HHHs in the three dimensional space. The chapter evaluates VHHH on synthetic and real world data, provides an interpretation alphabet, and identifies common visualization patterns of HHHs.

advances in databases and information systems | 2006

PPPA: push and pull pedigree analyzer for large and complex pedigree databases

Arturas Mazeika; Janis Petersons; Michael H. Böhlen

In this paper we introduce a novel push and pull technique to analyze pedigree data. We present the Push and Pull Pedigree Analyzer (PPPA) to organize large and complex pedigrees and investigate the development of genetic diseases. PPPA receives as input a pedigree (ancestry information) of different families. For each person the pedigree contains information about the occurrence of a specific genetic disease. We propose a new solution to arrange and visualize the individuals of the pedigree based on the relationships between individuals and information about the disease. PPPA starts with random positions of the individuals, and iteratively pushes apart non-relatives with opposite diseases patterns and pulls together relatives with identical disease patterns. The goal is a visualization that groups families with homogeneous disease patterns. We investigate our solution experimentally with genetic data from peoples from South Tyrol, Italy. We show that the algorithm converges independent of the number of individuals n and the complexity of the relationships. The runtime of the algorithm is super-linear wrt n. The space complexity of the algorithm is linear wrt n. The visual analysis of the method confirms that our push and pull technique successfully deals with large and complex pedigrees.

database and expert systems applications | 2009

Clustering of Short Strings in Large Databases

Michail Kazimianec; Arturas Mazeika

A novel method CLOSS intended for textual databases is proposed. It successfully identifies misspelled string clusters, even if the cluster border is not prominent. The method uses q-gram approach to represent data and a string proximity graph to find the cluster. Contribution refers to short string clustering in text mining, when the proximity graph has multiple horizontal lines or the line is not present.

Explore More