Carsten Binnig | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Carsten Binnig is active.

Explore More

Publication

Featured researches published by Carsten Binnig.

very large data bases | 2016

The end of slow networks: it's time for a redesign

Carsten Binnig; Andrew Crotty; Alex Galakatos; Tim Kraska; Erfan Zamanian

The next generation of high-performance networks with remote direct memory access (RDMA) capabilities requires a fundamental rethinking of the design of distributed in-memory DBMSs. These systems are commonly built under the assumption that the network is the primary bottleneck and should be avoided at all costs, but this assumption no longer holds. For instance, with InfiniBand FDR 4×, the bandwidth available to transfer data across the network is in the same ballpark as the bandwidth of one memory channel. Moreover, RDMA transfer latencies continue to rapidly improve as well. In this paper, we first argue that traditional distributed DBMS architectures cannot take full advantage of high-performance networks and suggest a new architecture to address this problem. Then, we discuss initial results from a prototype implementation of our proposed architecture for OLTP and OLAP, showing remarkable performance improvements over existing designs.

very large data bases | 2015

Vizdom: interactive analytics through pen and touch

Andrew Crotty; Alex Galakatos; Emanuel Zgraggen; Carsten Binnig; Tim Kraska

Machine learning (ML) and advanced statistics are important tools for drawing insights from large datasets. However, these techniques often require human intervention to steer computation towards meaningful results. In this demo, we present Vizdom, a new system for interactive analytics through pen and touch. Vizdoms frontend allows users to visually compose complex workflows of ML and statistics operators on an interactive whiteboard, and the back-end leverages recent advances in workflow compilation techniques to run these computations at interactive speeds. Additionally, we are exploring approximation techniques for quickly visualizing partial results that incrementally refine over time. This demo will show Vizdoms capabilities by allowing users to interactively build complex analytics workflows using real-world datasets.

very large data bases | 2015

An architecture for compiling UDF-centric workflows

Andrew Crotty; Alex Galakatos; Kayhan Dursun; Tim Kraska; Carsten Binnig; Ugur Çetintemel; Stan Zdonik

Data analytics has recently grown to include increasingly sophisticated techniques, such as machine learning and advanced statistics. Users frequently express these complex analytics tasks as workflows of user-defined functions (UDFs) that specify each algorithmic step. However, given typical hardware configurations and dataset sizes, the core challenge of complex analytics is no longer sheer data volume but rather the computation itself, and the next generation of analytics frameworks must focus on optimizing for this computation bottleneck. While query compilation has gained widespread popularity as a way to tackle the computation bottleneck for traditional SQL workloads, relatively little work addresses UDF-centric workflows in the domain of complex analytics. In this paper, we describe a novel architecture for automatically compiling workflows of UDFs. We also propose several optimizations that consider properties of the data, UDFs, and hardware together in order to generate different code on a case-by-case basis. To evaluate our approach, we implemented these techniques in Tupleware, a new high-performance distributed analytics system, and our benchmarks show performance improvements of up to three orders of magnitude compared to alternative systems.

european semantic web conference | 2015

RODI: A Benchmark for Automatic Mapping Generation in Relational-to-Ontology Data Integration

Christoph Pinkel; Carsten Binnig; Ernesto Jiménez-Ruiz; Wolfgang May; Dominique Ritze; Martin G. Skjæveland; Alessandro Solimando; Evgeny Kharlamov

A major challenge in information management today is the integration of huge amounts of data distributed across multiple data sources. A suggested approach to this problem is ontology-based data integration where legacy data systems are integrated via a common ontology that represents a unified global view over all data sources. However, data is often not natively born using these ontologies. Instead, much data resides in legacy relational databases. Therefore, mappings that relate the legacy relational data sources to the ontology need to be constructed. Recent techniques and systems that automatically construct such mappings have been developed. The quality metrics of these systems are, however, often only based on self-designed benchmarks. This paper introduces a new publicly available benchmarking suite called RODI, which is designed to cover a wide range of mapping challenges in Relational-to-Ontology Data Integration scenarios. RODI provides a set of different relational data sources and ontologies representing a wide range of mapping challenges as well as a scoring function with which the performance of relational-to-ontology mapping construction systems may be evaluated.

european semantic web conference | 2014

How to Best Find a Partner? An Evaluation of Editing Approaches to Construct R2RML Mappings

Christoph Pinkel; Carsten Binnig; Peter Haase; Clemens Martin; Kunal Sengupta; Johannes Trame

R2RML defines a language to express mappings from relational data to RDF. That way, applications built on top of the W3C Semantic Technology stack can seamlessly integrate relational data. A major obstacle to using R2RML, though, is the effort for manually curating the mappings. In particular in scenarios that aim to map data from huge and complex relational schemata (e.g., [5]) to more abstract ontologies efficient ways to support the mapping creation are needed.

international conference on management of data | 2015

Locality-aware Partitioning in Parallel Database Systems

Erfan Zamanian; Carsten Binnig; Abdallah Salama

Parallel database systems horizontally partition large amounts of structured data in order to provide parallel data processing capabilities for analytical workloads in shared-nothing clusters. One major challenge when horizontally partitioning large amounts of data is to reduce the network costs for a given workload and a database schema. A common technique to reduce the network costs in parallel database systems is to co-partition tables on their join key in order to avoid expensive remote join operations. However, existing partitioning schemes are limited in that respect since only subsets of tables in complex schemata sharing the same join key can be co-partitioned unless tables are fully replicated. In this paper we present a novel partitioning scheme called predicate-based reference partition (or PREF for short) that allows to co-partition sets of tables based on given join predicates. Moreover, based on PREF, we present two automatic partitioning design algorithms to maximize data-locality. One algorithm only needs the schema and data whereas the other algorithm additionally takes the workload as input. In our experiments we show that our automated design algorithms can partition database schemata of different complexity and thus help to effectively reduce the runtime of queries under a given workload when compared to existing partitioning approaches.

very large data bases | 2017

The end of a myth: distributed transactions can scale

Erfan Zamanian; Carsten Binnig; Tim Harris; Tim Kraska

The common wisdom is that distributed transactions do not scale. But what if distributed transactions could be made scalable using the next generation of networks and a redesign of distributed databases? There would be no need for developers anymore to worry about co-partitioning schemes to achieve decent performance. Application development would become easier as data placement would no longer determine how scalable an application is. Hardware provisioning would be simplified as the system administrator can expect a linear scale-out when adding more machines rather than some complex sub-linear function, which is highly application specific. In this paper, we present the design of our novel scalable distributed database system NAM-DB and show that distributed transactions with the very common Snapshot Isolation guarantee can indeed scale using the next generation of RDMA-enabled network technology without any inherent bottlenecks. Our experiments with the TPC-C benchmark show that our system scales linearly to over 6.5 million distributed transactions per second on 56 machines.

international conference on management of data | 2016

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Yeounoh Chung; Michael L. Mortensen; Carsten Binnig; Tim Kraska

It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results? In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.

international conference on management of data | 2015

Cost-based Fault-tolerance for Parallel Data Processing

Abdallah Salama; Carsten Binnig; Tim Kraska; Erfan Zamanian

In order to deal with mid-query failures in parallel data engines (PDEs), different fault-tolerance schemes are implemented today: (1) fault-tolerance in parallel databases is typically implemented in a coarse-grained manner by restarting a query completely when a mid-query failure occurs, and (2) modern MapReduce-style PDEs implement a fine-grained fault-tolerance scheme, which either materializes intermediate results or implements a lineage model to recover from mid-query failures. However, neither of these schemes can efficiently handle mixed workloads with both short running interactive queries as well as long running batch queries nor do these schemes efficiently support a wide range of different cluster setups which vary in cluster size and other parameters such as the mean time between failures. In this paper, we present a novel cost-based fault-tolerance scheme which tackles this issue. Compared to the existing schemes, our scheme selects a subset of intermediates to be materialized such that the total query runtime is minimized under mid-query failures. Our experiments show that our cost-based fault-tolerance scheme outperforms all existing strategies and always selects the sweet spot for short- and long running queries as well as for different cluster setups.

international conference on management of data | 2016

VisTrees: fast indexes for interactive data exploration

Muhammad El-Hindi; Zheguang Zhao; Carsten Binnig; Tim Kraska

Visualizations are arguably the most important tool to explore, understand and convey facts about data. As part of interactive data exploration, visualizations might be used to quickly skim through the data and look for patterns. Unfortunately, database systems are not designed to efficiently support these workloads. As a result, visualizations often take very long to produce, creating a significant barrier to interactive data analysis. In this paper, we focus on the interactive computation of histograms for data exploration. To address this issue, we present a novel multi-dimensional index structure called VisTree. As a key contribution, this paper presents several techniques to better align the design of multi-dimensional indexes with the needs of visualization tools for data exploration. Our experiments show that the VisTree achieves a speed increase of up to three orders of magnitude compared to traditional multi-dimensional indexes and enables an interactive speed of below 500ms even on large data sets.

Explore More