Lukasz Golab | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Lukasz Golab is active.

Explore More

Publication

Featured researches published by Lukasz Golab.

international conference on management of data | 2003

Issues in data stream management

Lukasz Golab; M. Tamer Özsu

Traditional databases store sets of relatively static records with no pre-defined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require support for on-line analysis of rapidly changing data streams. Limitations of traditional DBMSs in supporting streaming applications have been recognized, prompting research to augment existing technologies and build new systems to manage streaming data. The purpose of this paper is to review recent work in data stream management systems, with an emphasis on application requirements, data models, continuous query languages, and query evaluation.

very large data bases | 2003

Processing sliding window multi-joins in continuous queries over data streams

Lukasz Golab; M. Tamer Özsu

We study sliding window multi-join processing in continuous queries over data streams. Several algorithms are reported for performing continuous, incremental joins, under the assumption that all the sliding windows fit in main memory. The algorithms include multiway incremental nested loop joins (NLJs) and multi-way incremental hash joins. We also propose join ordering heuristics to minimize the processing cost per unit time. We test a possible implementation of these algorithms and show that, as expected, hash joins are faster than NLJs for performing equi-joins, and that the overall processing cost is influenced by the strategies used to remove expired tuples from the sliding windows.

international conference on management of data | 2009

Stream warehousing with DataDepot

Lukasz Golab; Theodore Johnson; J. Spencer Seidel; Vladislav Shkapenyuk

We describe DataDepot, a tool for generating warehouses from streaming data feeds, such as network-traffic traces, router alerts, financial tickers, transaction logs, and so on. DataDepot is a streaming data warehouse designed to automate the ingestion of streaming data from a wide variety of sources and to maintain complex materialized views over these sources. As a streaming warehouse, DataDepot is similar to Data Stream Management Systems (DSMSs) with its emphasis on temporal data, best-effort consistency, and real-time response. However, as a data warehouse, DataDepot is designed to store tens to hundreds of terabytes of historical data, allow time windows measured in years or decades, and allow both real-time queries on recent data and deep analyses on historical data. In this paper we discuss the DataDepot architecture, with an emphasis on several of its novel and critical features. DataDepot is currently being used for five very large warehousing projects within AT&T; one of these warehouses ingests 500 Mbytes per minute (and is growing). We use these installations to illustrate streaming warehouse use and behavior, and design choices made in developing DataDepot. We conclude with a discussion of DataDepot applications and the efficacy of some optimizations.

international conference on data engineering | 2009

Scheduling Updates in a Real-Time Stream Warehouse

Lukasz Golab; Theodore Johnson; Vladislav Shkapenyuk

This paper discusses updating a data warehouse that collects near-real-time data streams from a variety of external sources. The objective is to keep all the tables and materialized views up-to-date as new data arrive over time. We define the notion of data staleness, formalize the problem of scheduling updates in a way that minimizes average data staleness, and present scheduling algorithms designed to handle the complex environment of a real-time stream warehouse. A novel feature of our scheduling framework is that it considers the effect of an update on the staleness of the underlying tables rather than any property of the update job itself (such as deadline).

international conference on management of data | 2005

Update-pattern-aware modeling and processing of continuous queries

Lukasz Golab; M. Tamer Özsu

A defining characteristic of continuous queries over on-line data streams, possibly bounded by sliding windows, is the potentially infinite and time-evolving nature of their inputs and outputs. New items continually arrive on the input streams and new results are continually produced. Additionally, inputs expire by falling out of range of their sliding windows and results expire when they cease to satisfy the query. This impacts continuous query processing in two ways. First, data stream systems allow tables to be queried alongside data streams, but in terms of query semantics, it is not clear how updates of tables are different from insertions and deletions caused by the movement of the sliding windows. Second, many interesting queries need to store state, which must be kept up-to-date as time goes on. Therefore, query processing efficiency depends highly on the amount of overhead involved in state maintenance.In this paper, we show that the above issues can be solved by understanding the update patterns of continuous queries and exploiting them during query processing. We propose a classification that defines four types of update characteristics. Using our classification, we present a definition of continuous query semantics that clearly states the role of relations. We then propose the notion of update-pattern-aware query processing, where physical implementations of query operators, including the data structures used for storing intermediate state, vary depending on the update patterns of their inputs and outputs. When tested on IP traffic logs, our update-pattern-aware query plans routinely outperform the existing techniques by an order of magnitude.

Synthesis Lectures on Data Management | 2010

Data Stream Management

Lukasz Golab; M. Tamer Zsu

In this lecture many applications process high volumes of streaming data, among them Internet traffic analysis, financial tickers, and transaction log mining. In general, a data stream is an unbounded data set that is produced incrementally over time, rather than being available in full before its processing begins. In this lecture, we give an overview of recent research in stream processing, ranging from answering simple queries on high-speed streams to loading real-time data feeds into a streaming warehouse for off-line analysis. We will discuss two types of systems for end-to-end stream processing: Data Stream Management Systems (DSMSs) and Streaming Data Warehouses (SDWs). A traditional database management system typically processes a stream of ad-hoc queries over relatively static data. In contrast, a DSMS evaluates static (long-running) queries on streaming data, making a single pass over the data and using limited working memory. In the first part of this lecture, we will discuss research problems in DSMSs, such as continuous query languages, non-blocking query operators that continually react to new data, and continuous query optimization. The second part covers SDWs, which combine the real-time response of a DSMS by loading new data as soon as they arrive with a data warehouses ability to manage Terabytes of historical data on secondary storage. Table of Contents: Introduction / Data Stream Management Systems / Streaming Data Warehouses / Conclusions

very large data bases | 2015

Profiling relational data: a survey

Ziawasch Abedjan; Lukasz Golab; Felix Naumann

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases.

extending database technology | 2004

On Indexing Sliding Windows over Online Data Streams

Lukasz Golab; Shaveen Garg; M. Tamer Özsu

We consider indexing sliding windows in main memory over on-line data streams. Our proposed data structures and query semantics are based on a division of the sliding window into sub-windows. By classifying windowed operators according to their method of execution, we motivate the need for two types of windowed indices: those which provide a list of attribute values and their counts for answering set-valued queries, and those which provide direct access to tuples for answering attribute-valued queries. We propose and evaluate indices for both of these cases and show that our techniques are more efficient than executing windowed queries without an index.

international conference on management of data | 2009

Estimating the confidence of conditional functional dependencies

Graham Cormode; Lukasz Golab; Korn Flip; Andrew McGregor; Divesh Srivastava; Xi Zhang

Conditional functional dependencies (CFDs) have recently been proposed as extensions of classical functional dependencies that apply to a certain subset of the relation, as specified by a pattern tableau. Calculating the support and confidence of a CFD (i.e., the size of the applicable subset and the extent to which it satisfies the CFD)gives valuable information about data semantics and data quality. While computing the support is easier, computing the confidence exactly is expensive if the relation is large, and estimating it from a random sample of the relation is unreliable unless the sample is large. We study how to efficiently estimate the confidence of a CFD with a small number of passes (one or two) over the input using small space. Our solutions are based on a variety of sampling and sketching techniques, and apply when the pattern tableau is known in advance, and also the harder case when this is given after the data have been seen. We analyze our algorithms, and show that they can guarantee a small additive error; we also show that relative errors guarantees are not possible. We demonstrate the power of these methods empirically, with a detailed study using both real and synthetic data. These experiments show that it is possible to estimate the CFD confidence very accurately with summaries which are much smaller than the size of the data they represent.

IEEE Transactions on Knowledge and Data Engineering | 2012

Scalable Scheduling of Updates in Streaming Data Warehouses

Lukasz Golab; Theodore Johnson; Vladislav Shkapenyuk

We discuss update scheduling in streaming data warehouses, which combine the features of traditional data warehouses and data stream systems. In our setting, external sources push append-only data streams into the warehouse with a wide range of interarrival times. While traditional data warehouses are typically refreshed during downtimes, streaming warehouses are updated as new data arrive. We model the streaming warehouse update problem as a scheduling problem, where jobs correspond to processes that load new data into tables, and whose objective is to minimize data staleness over time (at time t, if a table has been updated with information up to some earlier time r, its staleness is t minus r). We then propose a scheduling framework that handles the complications encountered by a stream warehouse: view hierarchies and priorities, data consistency, inability to preempt updates, heterogeneity of update jobs caused by different interarrival times and data volumes among different sources, and transient overload. A novel feature of our framework is that scheduling decisions do not depend on properties of update jobs (such as deadlines), but rather on the effect of update jobs on data staleness. Finally, we present a suite of update scheduling algorithms and extensive simulation experiments to map out factors which affect their performance.

Explore More