Is this you? Create Your Porfile

Matteo Interlandi

University of California, Los Angeles

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matteo Interlandi is active.

Explore More

Publication

Featured researches published by Matteo Interlandi.

very large data bases | 2015

Titian: data provenance support in Spark

Matteo Interlandi; Kshitij Shah; Sai Deep Tetali; Muhammad Ali Gulzar; Seunghyun Yoo; Miryung Kim; Todd D. Millstein; Tyson Condie

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today’s DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance—tracking data through transformations—in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds—orders-of-magnitude faster than alternative solutions—while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

international conference on software engineering | 2016

BigDebug: debugging primitives for interactive big data processing in spark

Muhammad Ali Gulzar; Matteo Interlandi; Seunghyun Yoo; Sai Deep Tetali; Tyson Condie; Todd D. Millstein; Miryung Kim

Developers use cloud computing platforms to process a large quantity of data in parallel when developing big data analytics. Debugging the massive parallel computations that run in today’s data-centers is time consuming and error-prone. To address this challenge, we design a set of interactive, real-time debugging primitives for big data processing in Apache Spark, the next generation data-intensive scalable cloud computing platform. This requires re-thinking the notion of step-through debugging in a traditional debugger such as gdb, because pausing the entire computation across distributed worker nodes causes significant delay and naively inspecting millions of records using a watchpoint is too time consuming for an end user.First, BigDebug’s simulated breakpoints and on-demand watchpoints allow users to selectively examine distributed, intermediate data on the cloud with little overhead. Second, a user can also pinpoint a crash-inducing record and selectively resume relevant sub-computations after a quick fix. Third, a user can determine the root causes of errors (or delays) at the level of individual records through a fine-grained data provenance capability. Our evaluation shows that BigDebug scales to terabytes and its record-level tracing incurs less than 25% overhead on average. It determines crash culprits orders of magnitude more accurately and provides up to 100% time saving compared to the baseline replay debugger. The results show that BigDebug supports debugging at interactive speeds with minimal performance impact.

Information Systems | 2016

Combining user and database perspective for solving keyword queries over relational databases

Sonia Bergamaschi; Francesco Guerra; Matteo Interlandi; Raquel Trillo-Lado; Yannis Velegrakis

Abstract Over the last decade, keyword search over relational data has attracted considerable attention. A possible approach to face this issue is to transform keyword queries into one or more SQL queries to be executed by the relational DBMS. Finding these queries is a challenging task since the information they represent may be modeled across different tables and attributes. This means that it is needed to identify not only the schema elements where the data of interest is stored, but also to find out how these elements are interconnected. All the approaches that have been proposed so far provide a monolithic solution. In this work, we, instead, divide the problem into three steps: the first one, driven by the user׳s point of view, takes into account what the user has in mind when formulating keyword queries, the second one, driven by the database perspective, considers how the data is represented in the database schema. Finally, the third step combines these two processes. We present the theory behind our approach, and its implementation into a system called QUEST (QUEry generator for STructured sources), which has been deeply tested to show the efficiency and effectiveness of our approach. Furthermore, we report on the outcomes of a number of experimental results that we have conducted.

symposium on cloud computing | 2016

Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale

Muhuan Huang; Di Wu; Cody Hao Yu; Zhenman Fang; Matteo Interlandi; Tyson Condie; Jason Cong

With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsofts FPGA deployment in its Bing search engine and Intels 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems---like Apache Spark and Hadoop---to access the performance and energy benefits of FPGA accelerators. In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7× to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.

international conference on management of data | 2016

Big Data Analytics with Datalog Queries on Spark

Alexander Shkapsky; Mohan Yang; Matteo Interlandi; Hsuan Chiu; Tyson Condie; Carlo Zaniolo

There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics.

foundations of software engineering | 2016

BigDebug: interactive debugger for big data analytics in Apache Spark

Muhammad Ali Gulzar; Matteo Interlandi; Tyson Condie; Miryung Kim

To process massive quantities of data, developers leverage data-intensive scalable computing (DISC) systems in the cloud, such as Googles MapReduce, Apache Hadoop, and Apache Spark. In terms of debugging, DISC systems support post-mortem log analysis but do not provide interactive debugging features in realtime. This tool demonstration paper showcases a set of concrete usecases on how BigDebug can help debug Big Data Applications by providing interactive, realtime debug primitives. To emulate interactive step-wise debugging without reducing throughput, BigDebug provides simulated breakpoints to enable a user to inspect a program without actually pausing the entire computation. To minimize unnecessary communication and data transfer, BigDebug provides on-demand watchpoints that enable a user to retrieve intermediate data using a guard and transfer the selected data on demand. To support systematic and efficient trial-and-error debugging, BigDebug also enables users to change program logic in response to an error at runtime and replay the execution from that step. BigDebug is available for download at http://web.cs.ucla.edu/~miryung/software.html

Theory and Practice of Logic Programming | 2017

Fixpoint semantics and optimization of recursive Datalog programs with aggregates

Carlo Zaniolo; Mohan Yang; Ariyam Das; Alexander Shkapsky; Tyson Condie; Matteo Interlandi

A very desirable Datalog extension investigated by many researchers in the last thirty years consists in allowing the use of the basic SQL aggregates min, max, count and sum in recursive rules. In this paper, we propose a simple comprehensive solution that extends the declarative least-fixpoint semantics of Horn Clauses, along with the optimization techniques used in the bottom-up implementation approach adopted by many Datalog systems. We start by identifying a large class of programs of great practical interest in which the use of min or max in recursive rules does not compromise the declarative fixpoint semantics of the programs using those rules. Then, we revisit the monotonic versions of count and sum aggregates proposed in (Mazuran et al. 2013b) and named, respectively, mcount and msum. Since mcount, and also msum on positive numbers, are monotonic in the lattice of set-containment, they preserve the fixpoint semantics of Horn Clauses. However, in many applications of practical interest, their use can lead to inefficiencies, that can be eliminated by combining them with max, whereby mcount and msum become the standard count and sum. Therefore, the semantics and optimization techniques of Datalog are extended to recursive programs with min, max, count and sum, making possible the advanced applications of superior performance and scalability demonstrated by BigDatalog (Shkapsky et al. 2016) and Datalog-MC (Yang et al. 2017). This paper is under consideration for acceptance in TPLP.

symposium on cloud computing | 2017

Automated debugging in data-intensive scalable computing

Muhammad Ali Gulzar; Matteo Interlandi; Xueyuan Han; Mingda Li; Tyson Condie; Miryung Kim

Developing Big Data Analytics workloads often involves trial and error debugging, due to the unclean nature of datasets or wrong assumptions made about data. When errors (e.g., program crash, outlier results, etc.) arise, developers are often interested in identifying a subset of the input data that is able to reproduce the problem. BigSift is a new faulty data localization approach that combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failure-inducing inputs. BigSift redefines data provenance for the purpose of debugging using a test oracle function and implements several unique optimizations, specifically geared towards the iterative nature of automated debugging workloads. BigSift improves the accuracy of fault localizability by several orders-of-magnitude (∼103 to 107×) compared to Titian data provenance, and improves performance by up to 66× compared to Delta Debugging, an automated fault-isolation technique. For each faulty output, BigSift is able to localize fault-inducing data within 62% of the original job running time.

international conference on management of data | 2017

Debugging Big Data Analytics in Spark with BigDebug

Muhammad Ali Gulzar; Matteo Interlandi; Tyson Condie; Miryung Kim

To process massive quantities of data, developers leverage Data-Intensive Scalable Computing (DISC) systems such as Apache Spark. In terms of debugging, DISC systems support only post-mortem log analysis and do not provide any debugging functionality. This demonstration paper showcases BigDebug: a tool enhancing Apache Spark with a set of interactive debugging features that can help users in debug their Big Data Applications.

The Vldb Journal | 2018

Adding data provenance support to Apache Spark

Matteo Interlandi; Ari Ekmekji; Kshitij Shah; Muhammad Ali Gulzar; Sai Deep Tetali; Miryung Kim; Todd D. Millstein; Tyson Condie

Debugging data processing logic in data-intensive scalable computing (DISC) systems is a difficult and time-consuming effort. Today’s DISC systems offer very little tooling for debugging programs, and as a result, programmers spend countless hours collecting evidence (e.g., from log files) and performing trial-and-error debugging. To aid this effort, we built Titian, a library that enables data provenance—tracking data through transformations—in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds—orders of magnitude faster than alternative solutions—while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

Explore More