Is this you? Create Your Porfile

Maik Thiele

Dresden University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Maik Thiele is active.

Explore More

Publication

Featured researches published by Maik Thiele.

data warehousing and olap | 2007

Partition-based workload scheduling in living data warehouse environments

Maik Thiele; Ulrike Fischer; Wolfgang Lehner

The demand for so-called living or real-time data warehouses is increasing in many application areas such as manufacturing, event monitoring and telecommunications. In these fields users usually expect short response times for their queries and high freshness for the requested data. However, meeting these fundamental requirements is challenging due to the high loads and the continuous flow of write-only updates and read-only queries, which may be in conflict with each other. Therefore, we present the concept of Workload Balancing by Election (WINE), which allows users to express their individual demands on the Quality of Service and the Quality of Data respectively. WINE applies this information to balance and prioritize over both types of transactions -- queries and update -- according to the varying user needs. A simulation study shows that our proposed algorithm outperforms competitor baseline algorithms over the entire spectrum of workloads and user requirements.

International Journal of Data Warehousing and Mining | 2013

On-Demand ELT Architecture for Right-Time BI: Extending the Vision

Florian Waas; Robert Wrembel; Tobias Freudenreich; Maik Thiele; Christian Koncilia; Pedro Furtado

In a typical BI infrastructure, data, extracted from operational data sources, is transformed, cleansed, and loaded into a data warehouse by a periodic ETL process, typically executed on a nightly basis, i.e., a full days worth of data is processed and loaded during off-hours. However, it is desirable to have fresher data for business insights at near real-time. To this end, the authors propose to leverage a data warehouses capability to directly import raw, unprocessed records and defer the transformation and data cleaning until needed by pending reports. At that time, the databases own processing mechanisms can be deployed to process the data on-demand. Event-processing capabilities are seamlessly woven into our proposed architecture. Besides outlining an overall architecture, the authors also developed a roadmap for implementing a complete prototype using conventional database technology in the form of hierarchical materialized views.

very large data bases | 2012

DrillBeyond: enabling business analysts to explore the web of open data

Julian Eberius; Maik Thiele; Katrin Braunschweig; Wolfgang Lehner

Following the Open Data trend, governments and public agencies have started making their data available on the Web and established platforms such as data.gov or data.un.org. These Open Data platforms provide a huge amount of data for various topics such as demographics, transport, finance or health in various data formats. One typical usage scenario for this kind of data is their integration into a database or data warehouse in order to apply data analytics. However, in todays business intelligence tools there is an evident lack of support for so-called situational or ad-hoc data integration. In this demonstration we will therefore present DrillBeyond, a novel database and information retrieval engine which allows users to query a local database as well as the Web of Open Data in a seamless and integrated way with standard SQL. The audience will be able to pose queries to our DrillBeyond system which will be answered partly from local data in the database and partly from datasets that originate from the Web of Data. We will show how such queries are divided into known and unknown parts and how missing attributes are mapped to open datasets. We will demonstrate the integration of the open datasets back into the DBMS in order to apply its analytical features.

business intelligence for the real-time enterprises | 2009

Evaluation of Load Scheduling Strategies for Real-Time Data Warehouse Environments

Maik Thiele; Wolfgang Lehner

The demand for so-called living or real-time data warehouses is increasing in many application areas, including manufacturing, event monitoring and telecommunications. In fields like these, users normally expect short response times for their queries and high freshness for the requested data. However, it is truly challenging to meet both requirements at the same time because of the continuous flow of write-only updates and read-only queries as well as the latency caused by arbitrarily complex ETL processes. To optimize the update flow in terms of data freshness maximization and load minimization, we propose two algorithms — local and global scheduling — that operate on the basis of different system information. We want to discuss the benefits and drawbacks of both approaches in detail and derive recommendations regarding the optimal scheduling strategy for any given system setup and workload.

statistical and scientific database management | 2015

Top-k entity augmentation using consistent set covering

Julian Eberius; Maik Thiele; Katrin Braunschweig; Wolfgang Lehner

Entity augmentation is a query type in which, given a set of entities and a large corpus of possible data sources, the values of a missing attribute are to be retrieved. State of the art methods return a single result that, to cover all queried entities, is fused from a potentially large set of data sources. We argue that queries on large corpora of heterogeneous sources using information retrieval and automatic schema matching methods can not easily return a single result that the user can trust, especially if the result is composed from a large number of sources that user has to verify manually. We therefore propose to process these queries in a Top-k fashion, in which the system produces multiple minimal consistent solutions from which the user can choose to resolve the uncertainty of the data sources and methods used. In this paper, we introduce and formalize the problem of consistent, multi-solution set covering, and present algorithms based on a greedy and a genetic optimization approach. We then apply these algorithms to Web table-based entity augmentation. The publication further includes a Web table corpus with 100M tables, and a Web table retrieval and matching system in which these algorithms are implemented. Our experiments show that the consistency and minimality of the augmentation results can be improved using our set covering approach, without loss of precision or coverage and while producing multiple alternative query results.

data warehousing and olap | 2014

A Framework for User-Centered Declarative ETL

Vasileios Theodorou; Alberto Abelló; Maik Thiele; Wolfgang Lehner

As business requirements evolve with increasing information density and velocity, there is a growing need for efficiency and automation of Extract-Transform-Load (ETL) processes. Current approaches for the modeling and optimization of ETL processes provide platform-independent optimization solutions for the (semi-)automated transition among different abstraction levels, focusing on cost and performance. However, the suggested representations are not abstract enough to communicate business requirements and the role of the process quality in a user-centered perspective has not yet been adequately examined. In this paper, we introduce a novel methodology for the end-to-end design of ETL processes that takes under consideration both functional and non-functional requirements. Based on existing work, we raise the level of abstraction for the conceptual representation of ETL operations and we show how process quality characteristics can generate specific patterns on the process design.

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC) | 2015

Towards a Hybrid Imputation Approach Using Web Tables

Ahmad Ahmadov; Maik Thiele; Julian Eberius; Wolfgang Lehner; Robert Wrembel

Data completeness is one of the most important data quality dimensions and an essential premise in data analytics. With new emerging Big Data trends such as the data lake concept, which provides a low cost data preparation repository instead of moving curated data into a data warehouse, the problem of data completeness is additionally reinforced. While traditionally the process of filling in missing values is addressed by the data imputation community using statistical techniques, we complement these approaches by using external data sources from the data lake or even the Web to lookup missing values. In this paper we propose a novel hybrid data imputation strategy that, takes into account the characteristics of an incomplete dataset and based on that chooses the best imputation approach, i.e. either a statistical approach such as regression analysis or a Web-based lookup or a combination of both. We formalize and implement both imputation approaches, including a Web table retrieval and matching system and evaluate them extensively using a corpus with 125M Web tables. We show that applying statistical techniques in conjunction with external data sources will lead to a imputation system which is robust, accurate, and has high coverage at the same time.

international joint conference on knowledge discovery knowledge engineering and knowledge management | 2016

A Machine Learning Approach for Layout Inference in Spreadsheets

Elvis Koci; Maik Thiele; Oscar Romero; Wolfgang Lehner

Spreadsheet applications are one of the most used tools for content generation and presentation in industry and the Web. In spite of this success, there does not exist a comprehensive approach to automatically extract and reuse the richness of data maintained in this format. The biggest obstacle is the lack of awareness about the structure of the data in spreadsheets, which otherwise could provide the means to automatically understand and extract knowledge from these files. In this paper, we propose a classification approach to discover the layout of tables in spreadsheets. Therefore, we focus on the cell level, considering a wide range of features not covered before by related work. We evaluated the performance of our classifiers on a large dataset covering three different corpora from various domains. Finally, our work includes a novel technique for detecting and repairing incorrectly classified cells in a post-processing step. The experimental results show that our approach delivers very high accuracy bringing us a crucial step closer towards automatic table extraction.

Big Data Computing (BDC), 2015 IEEE/ACM 2nd International Symposium on | 2016

Building the Dresden Web Table Corpus: A Classification Approach

Julian Eberius; Katrin Braunschweig; Markus Hentsch; Maik Thiele; Ahmad Ahmadov; Wolfgang Lehner

In recent years, researchers have recognized relational tables on the Web as an important source of information. To assist this research we developed the Dresden Web Tables Corpus (DWTC), a collection of about 125 million data tables extracted from the Common Crawl (CC) which contains 3.6 billion web pages and is 266TB in size. As the vast majority of HTML tables are used for layout purposes and only a small share contains genuine tables with different surface forms, accurate table detection is essential for building a large-scale Web table corpus. Furthermore, correctly recognizing the table structure (e.g. horizontal listings, matrices) is important in order to understand the role of each table cell, distinguishing between label and data cells. In this paper, we present an extensive table layout classification that enables us to identify the main layout categories of Web tables with very high precision. We therefore identify and develop a plethora of table features, different feature selection techniques and several classification algorithms. We evaluate the effectiveness of the selected features and compare the performance of various state-of-the-art classification algorithms. Finally, the winning approach is employed to classify millions of tables resulting in the Dresden Web Table Corpus (DWTC).

statistical and scientific database management | 2015

DrillBeyond: processing multi-result open world SQL queries

Julian Eberius; Maik Thiele; Katrin Braunschweig; Wolfgang Lehner

In a traditional relational database management system, queries can only be defined over attributes defined in the schema, but are guaranteed to give single, definitive answer structured exactly as specified in the query. In contrast, an information retrieval system allows the user to pose queries without knowledge of a schema, but the result will be a top-k list of possible answers, with no guarantees about the structure or content of the retrieved documents. In this paper, we present DrillBeyond, a novel IR/RDBMS hybrid system, in which the user seamlessly queries a relational database together with a large corpus of tables extracted from a web crawl. The system allows full SQL queries over the relational database, but additionally allows the user to use arbitrary additional attributes in the query that need not to be defined in the schema. The system then processes this semi-specified query by computing a top-k list of possible query evaluations, each based on different candidate web data sources, thus mixing properties of RDBMS and IR systems. We design a novel plan operator that encapsulates a web data retrieval and matching system and allows direct integration of such systems into relational query processing. We then present methods for efficiently processing multiple variants of a query, by producing plans that are optimized for large invariant intermediate results that can be reused between multiple query evaluations. We demonstrate the viability of the operator and our optimization strategies by implementing them in PostgreSQL and evaluating on a standard benchmark by adding arbitrary attributes to its queries.

Explore More