Hannes Mühleisen
Centrum Wiskunde & Informatica
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hannes Mühleisen.
international semantic web conference | 2013
Kai Eckert; Robert Meusel; Hannes Mühleisen; Michael Schuhmacher; Johanna Völker
More and more websites embed structured data describing for instance products, reviews, blog posts, people, organizations, events, and cooking recipes into their HTML pages using markup standards such as Microformats, Microdata and RDFa. This development has accelerated in the last two years as major Web companies, such as Google, Facebook, Yahoo!, and Microsoft, have started to use the embedded data within their applications. In this paper, we analyze the adoption of RDFa, Microdata, and Microformats across the Web. Our study is based on a large public Web crawl dating from early 2012 and consisting of 3 billion HTML pages which originate from over 40 million websites. The analysis reveals the deployment of the different markup standards, the main topical areas of the published data as well as the different vocabularies that are used within each topical area to represent data. What distinguishes our work from earlier studies, published by the large Web companies, is that the analyzed crawl as well as the extracted data are publicly available. This allows our findings to be verified and to be used as starting points for further domain-specific investigations as well as for focused information extraction endeavors.
data management on new hardware | 2013
Hannes Mühleisen; Romulo Goncalves; Martin L. Kersten
Many database systems share a need for large amounts of fast storage. However, economies of scale limit the utility of extending a single machine with an arbitrary amount of memory. The recent broad availability of the zero-copy data transfer protocol RDMA over low-latency and high-throughput network connections such as InfiniBand prompts us to revisit the long-proposed usage of memory provided by remote machines. In this paper, we present a solution to make use of remote memory without manipulation of the operating system, and investigate the impact on database performance.
statistical and scientific database management | 2013
Hannes Mühleisen; Thomas Lumley
Statistics software packages and relational database systems possess considerable overlap in the area of data loading, handling, and transformation. However, only databases are mainly optimized towards high performance in this area. In this paper, we present our approach on bringing the best of these two worlds together. We integrate the analytics-optimized database MonetDB and the R environment for statistical computing in a non-obtrusive, transparent and compatible way.
statistical and scientific database management | 2016
Mark Raasveldt; Hannes Mühleisen
Data Scientists rely on vector-based scripting languages such as R, Python and MATLAB to perform ad-hoc data analysis on potentially large data sets. When facing large data sets, they are only efficient when data is processed using vectorized or bulk operations. At the same time, overwhelming volume and variety of data as well as parsing overhead suggests that the use of specialized analytical data management systems would be beneficial. Data might also already be stored in a database. Efficient execution of data analysis programs such as data mining directly inside a database greatly improves analysis efficiency. We investigate how these vector-based languages can be efficiently integrated in the processing model of operator--at--a--time databases. We present MonetDB/Python, a new system that combines the open-source database MonetDB with the vector-based language Python. In our evaluation, we demonstrate efficiency gains of orders of magnitude.
very large data bases | 2017
Mark Raasveldt; Hannes Mühleisen
Transferring a large amount of data from a database to a client program is a surprisingly expensive operation. The time this requires can easily dominate the query execution time for large result sets. This represents a significant hurdle for external data analysis, for example when using statistical software. In this paper, we explore and analyse the result set serialization design space. We present experimental results from a large chunk of the database market and show the inefficiencies of current approaches. We then propose a columnar serialization method that improves transmission performance by an order of magnitude.
Publications of the Astronomical Society of the Pacific | 2016
Meng Wan; Chao Wu; Jing Wang; Y.-L. Qiu; L. P. Xin; Sjoerd Mullender; Hannes Mühleisen; Bart Scheers; Ying Zhang; Niels Nes; Martin L. Kersten; Yongpan Huang; J. S. Deng; Jian-Yan Wei
The ground-based wide-angle camera array (GWAC), a part of the SVOM space mission, will search for various types of optical transients by continuously imaging a field of view (FOV) of 5000 degrees2 every 15 s. Each exposure consists of 36 × 4k × 4k pixels, typically resulting in 36 × ~175,600 extracted sources. For a modern time-domain astronomy project like GWAC, which produces massive amounts of data with a high cadence, it is challenging to search for short timescale transients in both real-time and archived data, and to build long-term light curves for variable sources. Here, we develop a high-cadence, high-density light curve pipeline (HCHDLP) to process the GWAC data in real-time, and design a distributed shared-nothing database to manage the massive amount of archived data which will be used to generate a source catalog with more than 100 billion records during 10 years of operation. First, we develop HCHDLP based on the column-store DBMS of MonetDB, taking advantage of MonetDBs high performance when applied to massive data processing. To realize the real-time functionality of HCHDLP, we optimize the pipeline in its source association function, including both time and space complexity from outside the database (SQL semantic) and inside (RANGE-JOIN implementation), as well as in its strategy of building complex light curves. The optimized source association function is accelerated by three orders of magnitude. Second, we build a distributed database using a two-level time partitioning strategy via the MERGE TABLE and REMOTE TABLE technology of MonetDB. Intensive tests validate that our database architecture is able to achieve both linear scalability in response time and concurrent access by multiple users. In summary, our studies provide guidance for a solution to GWAC in real-time data processing and management of massive data.
international conference on e-science | 2015
Romulo Goncalves; Milena Ivanova; Foteini Alvanaki; J. Maassen; Konstantinos Kyzirakos; Oscar Martinez-Rubi; Hannes Mühleisen
Earth observation sciences produce large sets of data which are inherently rich in spatial and geo-spatial information. Together with live data collected from monitoring systems and large collections of semantically rich objects they provide new opportunities for advanced eScience research on climatology, urban planing and smart cities. Such combination of heterogeneous data sets forms a new source of knowledge. Efficient knowledge extraction from them is an eScience challenge. It requires efficient bulk data injection from both static and streaming data sources, dynamic adaptation of the physical and logical schema, efficient methods to correlate spatial and temporal data, and flexibility to (re-)formulate the research question at any time. In this work, we present a data management layer over a column-oriented relational data management system that provides efficient analysis of spatiotemporal data. It provides fast data ingestion through different data loaders, tabular and array based storage, and a dynamic step-wise exploration.
statistical and scientific database management | 2014
Jonathan Lajus; Hannes Mühleisen
Statistical analysts have long been struggling with evergrowing data volumes. While specialized data management systems such as relational databases would be able to handle the data, statistical analysis tools are far more convenient to express complex data analyses. An integration of these two classes of systems has the potential to overcome the data management issue while at the same time keeping analysis convenient. However, one must keep a careful eye on implementation overheads such as serialization. In this paper, we propose the in-process integration of data management and analytical tools. Furthermore, we argue that a zero-copy integration is feasible due to the omnipresence of C-style arrays containing native types. We discuss the general concept and present a prototype of this integration based on the columnar relational database MonetDB and the R environment for statistical computing. We evaluate the performance of this prototype in a series of micro-benchmarks of common data management tasks.
international conference on management of data | 2018
Mark Raasveldt; Pedro Holanda; Tim Gubner; Hannes Mühleisen
Performance benchmarking is one of the most commonly used methods for comparing different systems or algorithms, both in scientific literature and in industrial publications. While performance measurements might seem objective on the surface, there are many different ways to influence benchmark results to favor one system over the other, either by accident or on purpose. In this paper, we perform a study of the common pitfalls in DBMS performance comparisons, and give advice on how they can be spotted and avoided so a fair performance comparison between systems can be made. We illustrate the common pitfalls with a series of mock benchmarks, which show large differences in performance where none should be present.
statistical and scientific database management | 2017
Till Döhmen; Hannes Mühleisen; Peter A. Boncz
Comma Separated Value (CSV) files are commonly used to represent data. CSV is a very simple format, yet we show that it gives rise to a surprisingly large amount of ambiguities in its parsing and interpretation. We summarize the state-of-the-art in CSV parsers, which typically make a linear series of parsing and interpretation decisions, such that any wrong decision at an earlier stage can negatively affect all downstream decisions. Since computation time is much less scarce than human time, we propose to turn CSV parsing into a ranking problem. Our quality-oriented multi-hypothesis CSV parsing approach generates several concurrent hypotheses about dialect, table structure, etc. and ranks these hypotheses based on quality features of the resulting table. This approach makes it possible to create an advanced CSV parser that makes many different decisions, yet keeps the overall parser code a simple plug-in infrastructure. The complex interactions between these decisions are taken care of by searching the hypothesis space rather than by having to program these many interactions in code. We show that our approach leads to better parsing results than the state of the art and facilitates the parsing of large corpora of heterogeneous CSV files.