Astrid Rheinländer | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Astrid Rheinländer is active.

Explore More

Publication

Featured researches published by Astrid Rheinländer.

very large data bases | 2014

The Stratosphere platform for big data analytics

Alexander Alexandrov; Rico Bergmann; Stephan Ewen; Johann Christoph Freytag; Fabian Hueske; Arvid Heise; Odej Kao; Marcus Leich; Ulf Leser; Volker Markl; Felix Naumann; Mathias Peters; Astrid Rheinländer; Matthias J. Sax; Sebastian Schelter; Mareike Hoger; Kostas Tzoumas; Daniel Warneke

We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere’s features include “in situ” data processing, a declarative query language, treatment of user-defined functions as first-class citizens, automatic program parallelization and optimization, support for iterative programs, and a scalable and efficient execution engine. Stratosphere covers a variety of “Big Data” use cases, such as data warehousing, information extraction and integration, data cleansing, graph analysis, and statistical analysis applications. In this paper, we present the overall system architecture design decisions, introduce Stratosphere through example queries, and then dive into the internal workings of the system’s components that relate to extensibility, programming model, optimization, and query execution. We experimentally compare Stratosphere against popular open-source alternatives, and we conclude with a research outlook for the next years.

very large data bases | 2012

Opening the black boxes in data flow optimization

Fabian Hueske; Mathias Peters; Matthias J. Sax; Astrid Rheinländer; Rico Bergmann; Aljoscha Krettek; Kostas Tzoumas

Many systems for big data analytics employ a data flow abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators by statically analyzing the general-purpose code of their user-defined functions. We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators. Our evaluation confirms that the optimizer can apply common rewritings such as selection reordering, bushy join-order enumeration, and limited forms of aggregation push-down, hence yielding similar rewriting power as modern relational DBMS optimizers. Moreover, it can optimize the operator order of nonrelational data flows, a unique feature among todays systems.

Datenbank-spektrum | 2012

Data Management Challenges in Next Generation Sequencing

Sebastian Wandelt; Astrid Rheinländer; Marc Bux; Lisa Thalheim; Berit Haldemann; Ulf Leser

Since the early days of the Human Genome Project, data management has been recognized as a key challenge for modern molecular biology research. By the end of the nineties, technologies had been established that adequately supported most ongoing projects, typically built upon relational database management systems. However, recent years have seen a dramatic increase in the amount of data produced by typical projects in this domain. While it took more than ten years, approximately three billion USD, and more than 200 groups worldwide to assemble the first human genome, today’s sequencing machines produce the same amount of raw data within a week, at a cost of approximately 2000 USD, and on a single device. Several national and international projects now deal with (tens of) thousands of genomes, and trends like personalized medicine call for efforts to sequence entire populations. In this paper, we highlight challenges that emerge from this flood of data, such as parallelization of algorithms, compression of genomic sequences, and cloud-based execution of complex scientific workflows. We also point to a number of further challenges that lie ahead due to the increasing demand for translational medicine, i.e., the accelerated transition of biomedical research results into medical practice.

north american chapter of the association for computational linguistics | 2009

Molecular event extraction from Link Grammar parse trees

Jörg Hakenberg; Illés Solt; Domonkos Tikk; Luis Tari; Astrid Rheinländer; Nguyen Quang Long; Graciela Gonzalez; Ulf Leser

We present an approach for extracting molecular events from literature based on a deep parser, using in a query language for parse trees. Detected events range from gene expression to protein localization, and cover a multitude of different entity types, including genes/proteins, binding sites, and locations. Furthermore, our approach is capable of recognizing negation and the speculative character of extracted statements. We first parse documents using Link Grammar (BioLG) and store the parse trees in a database. Events are extracted using a newly developed query language with traverses the BioLG linkages between trigger terms, arguments, and events. The concrete queries are learnt from an annotated corpus. On BioNLP Shared Task data, we achieve an overall F1-measure of 29.6%.

Information Systems | 2015

SOFA: An extensible logical optimizer for UDF-heavy data flows

Astrid Rheinländer; Arvid Heise; Fabian Hueske; Ulf Leser; Felix Naumann

Abstract Recent years have seen an increased interest in large-scale analytical data flows on non-relational data. These data flows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such data flows are user-defined predicates or functions (U df s). However, the heavy use of U df s is not well taken into account for data flow optimization in current systems. S ofa is a novel and extensible optimizer for U df -heavy data flows. It builds on a concise set of properties for describing the semantics of Map/Reduce-style U df s and a small set of rewrite rules, which use these properties to find a much larger number of semantically equivalent plan rewrites than possible with traditional techniques. A salient feature of our approach is extensibility: we arrange user-defined operators and their properties into a subsumption hierarchy, which considerably eases integration and optimization of new operators. We evaluate S ofa on a selection of U df -heavy data flows from different domains and compare its performance to three other algorithms for data flow optimization. Our experiments reveal that S ofa finds efficient plans, outperforming the best plans found by its competitors by a factor of up to six.

statistical and scientific database management | 2010

Prefix tree indexing for similarity search and similarity joins on genomic data

Astrid Rheinländer; Martin Knobloch; Nicky Hochmuth; Ulf Leser

Similarity search and similarity join on strings are important for applications such as duplicate detection, error detection, data cleansing, or comparison of biological sequences. Especially DNA sequencing produces large collections of erroneous strings which need to be searched, compared, and merged. However, current RDBMS offer similarity operations only in a very limited and inefficient form that does not scale to the amount of data produced in Life Science projects. We present PETER, a prefix tree based indexing algorithm supporting approximate search and approimate joins. Our tool supports Hamming and edit distance as similarity measure and is available as C++ library, as Unix command line tool, and as cartridge for a commercial database. It combines an efficient implementation of compressed prefix trees with advanced pre-filtering techniques that exclude many candidate strings early. The achieved speed-ups are dramatic, especially for DNA with its small alphabet. We evaluate our tool on several collections of long strings containing up to 5,000,000 entries of length up to 3,500. We compare its performance to agrep, nrgrep, and user-defined functions inside a relational database. Our experiments reveal that PETER is faster by orders of magnitudes compared to the command-line tools. Compared to RDBMS, it computes similarity joins in minutes for which UDFs did not finish within a day and outperforms the built-in join methods even in the exact case.

statistical and scientific database management | 2012

Efficient similarity search in very large string sets

Dandy Fenz; Dustin Lange; Astrid Rheinländer; Felix Naumann; Ulf Leser

String similarity search is required by many real-life applications, such as spell checking, data cleansing, fuzzy keyword search, or comparison of DNA sequences. Given a very large string set and a query string, the string similarity search problem is to efficiently find all strings in the string set that are similar to the query string. Similarity is defined using a similarity (or distance) measure, such as edit distance or Hamming distance. In this paper, we introduce the State Set Index (SSI) as an efficient solution for this search problem. SSI is based on a trie (prefix index) that is interpreted as a nondeterministic finite automaton. SSI implements a novel state labeling strategy making the index highly space-efficient. Furthermore, SSIs space consumption can be gracefully traded against search time. We evaluated SSI on different sets of person names with up to 170 million strings from a social network and compared it to other state-of-the-art methods. We show that in the majority of cases, SSI is significantly faster than other tools and requires less index space.

ACM Computing Surveys | 2017

Optimization of Complex Dataflows with User-Defined Functions

Astrid Rheinländer; Ulf Leser; Goetz Graefe

In many fields, recent years have brought a sharp rise in the size of the data to be analyzed and the complexity of the analysis to be performed. Such analyses are often described as dataflows specified in declarative dataflow languages. A key technique to achieve scalability for such analyses is the optimization of the declarative programs; however, many real-life dataflows are dominated by user-defined functions (UDFs) to perform, for instance, text analysis, graph traversal, classification, or clustering. This calls for specific optimization techniques as the semantics of such UDFs are unknown to the optimizer. In this article, we survey techniques for optimizing dataflows with UDFs. We consider methods developed over decades of research in relational database systems as well as more recent approaches spurred by the popularity of Map/Reduce-style data processing frameworks. We present techniques for syntactical dataflow modification, approaches for inferring semantics and rewrite options for UDFs, and methods for dataflow transformations both on the logical and the physical levels. Furthermore, we give a comprehensive overview on declarative dataflow languages for Big Data processing systems from the perspective of their build-in optimization techniques. Finally, we highlight open research challenges with the intention to foster more research into optimizing dataflows that contain UDFs.

statistical and scientific database management | 2016

PIEJoin: Towards Parallel Set Containment Joins

Anja Kunkel; Astrid Rheinländer; Christopher Schiefer; Sven Helmer; Panagiotis Bouros; Ulf Leser

The efficient computation of set containment joins (SCJ) over set-valued attributes is a well-studied problem with many applications in commercial and scientific fields. Nevertheless, there still exists a number of open questions: An extensive comparative evaluation is still missing, the two most recent algorithms have not yet been compared to each other, and the exact impact of item sort order and properties of the data on algorithms performance still is largely unknown. Furthermore, all previous works only considered sequential join algorithms, although modern servers offer ample opportunities for parallelization. We present PIEJoin, a novel algorithm for computing SCJ based on intersecting prefix trees built at runtime over the to-be-joined attributes. We also present a highly optimized implementation of PIEJoin which uses tree signatures for saving space and interval labeling for improving runtime of the basic method. Most importantly, PIEJoin can be parallelized easily by partitioning the tree intersection. A comprehensive evaluation on eight data sets shows that PIEJoin already in its sequential form clearly outperforms two of the three most important competitors (PRETTI and PRETTI+). It is mostly yet not always slower than the third, LIMIT+(opj) but requires significantly less space. The parallel version of PIEJoin we present here achieves significant further speed-ups, yet our evaluation also shows that further research is needed as finding the best way of partitioning the join turns out to be non-trivial.

international conference on parallel processing | 2011

Scalable sequence similarity search and join in main memory on multi-cores

Astrid Rheinländer; Ulf Leser

Similarity-based queries play an important role in many large scale applications. In bioinformatics, DNA sequencing produces huge collections of strings, that need to be compared and merged. We present PeARL, a data structure and algorithms for similarity-based queries on many-core servers. PeARL indexes large string collections in compressed tries which are entirely held in main memory. Parallelization of searches and joins is performed using MapReduce as the underlying execution paradigm. We show that our data structure is capable of performing many real-world applications in sequence comparisons in main memory. Our evaluation reveals that PeARL reaches a significant performance gain compared to single-threaded solutions. However, the evaluation also shows that scalability should be further improved, e.g., by reducing sequential parts of the algorithms.

Explore More