Carlo Curino | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Carlo Curino is active.

Explore More

Publication

Featured researches published by Carlo Curino.

very large data bases | 2010

Schism: a workload-driven approach to database replication and partitioning

Carlo Curino; Evan Philip Charles Jones; Yang Zhang; Samuel Madden

We present Schism, a novel workload-aware approach for database partitioning and replication designed to improve scalability of shared-nothing distributed databases. Because distributed transactions are expensive in OLTP settings (a fact we demonstrate through a series of experiments), our partitioner attempts to minimize the number of distributed transactions, while producing balanced partitions. Schism consists of two phases: i) a workload-driven, graph-based replication/partitioning phase and ii) an explanation and validation phase. The first phase creates a graph with a node per tuple (or group of tuples) and edges between nodes accessed by the same transaction, and then uses a graph partitioner to split the graph into k balanced partitions that minimize the number of cross-partition transactions. The second phase exploits machine learning techniques to find a predicate-based explanation of the partitioning strategy (i.e., a set of range predicates that represent the same replication/partitioning scheme produced by the partitioner). The strengths of Schism are: i) independence from the schema layout, ii) effectiveness on n-to-n relations, typical in social network databases, iii) a unified and fine-grained approach to replication and partitioning. We implemented and tested a prototype of Schism on a wide spectrum of test cases, ranging from classical OLTP workloads (e.g., TPC-C and TPC-E), to more complex scenarios derived from social network websites (e.g., Epinions.com), whose schema contains multiple n-to-n relationships, which are known to be hard to partition. Schism consistently outperforms simple partitioning schemes, and in some cases proves superior to the best known manual partitioning, reducing the cost of distributed transactions up to 30%.

international conference on management of data | 2007

A data-oriented survey of context models

Carlo Curino; Elisa Quintarelli; Fabio A. Schreiber; Letizia Tanca

Context-aware systems are pervading everyday life, therefore context modeling is becoming a relevant issue and an expanding research field. This survey has the goal to provide a comprehensive evaluation framework, allowing application designers to compare context models with respect to a given target application; in particular we stress the analysis of those features which are relevant for the problem of data tailoring. The contribution of this paper is twofold: a general analysis framework for context models and an up-to-date comparison of the most interesting, data-oriented approaches available in the literature.

international conference on management of data | 2012

Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems

Andrew Pavlo; Carlo Curino; Stanley B. Zdonik

The advent of affordable, shared-nothing computing systems portends a new class of parallel database management systems (DBMS) for on-line transaction processing (OLTP) applications that scale without sacrificing ACID guarantees [7, 9]. The performance of these DBMSs is predicated on the existence of an optimal database design that is tailored for the unique characteristics of OLTP workloads. Deriving such designs for modern DBMSs is difficult, especially for enterprise-class OLTP systems, since they impose extra challenges: the use of stored procedures, the need for load balancing in the presence of time-varying skew, complex schemas, and deployments with larger number of partitions. To this purpose, we present a novel approach to automatically partitioning databases for enterprise-class OLTP systems that significantly extends the state of the art by: (1) minimizing the number distributed transactions, while concurrently mitigating the effects of temporal skew in both the data distribution and accesses, (2) extending the design space to include replicated secondary indexes, (4) organically handling stored procedure routing, and (3) scaling of schema complexity, data size, and number of partitions. This effort builds on two key technical contributions: an analytical cost model that can be used to quickly estimate the relative coordination cost and skew for a given workload and a candidate database design, and an informed exploration of the huge solution space based on large neighborhood search. To evaluate our methods, we integrated our database design tool with a high-performance parallel, main memory DBMS and compared our methods against both popular heuristics and a state-of-the-art research prototype [17]. Using a diverse set of benchmarks, we show that our approach improves throughput by up to a factor of 16x over these other approaches.

very large data bases | 2008

Graceful database schema evolution: the PRISM workbench

Carlo Curino; Hyun Jin Moon; Carlo Zaniolo

Supporting graceful schema evolution represents an unsolved problem for traditional information systems that is further exacerbated in web information systems, such as Wikipedia and public scientific databases: in these projects based on multiparty cooperation the frequency of database schema changes has increased while tolerance for downtimes has nearly disappeared. As of today, schema evolution remains an error-prone and time-consuming undertaking, because the DB Administrator (DBA) lacks the methods and tools needed to manage and automate this endeavor by (i) predicting and evaluating the effects of the proposed schema changes, (ii) rewriting queries and applications to operate on the new schema, and (iii) migrating the database. Our PRISM system takes a big first step toward addressing this pressing need by providing: (i) a language of Schema Modification Operators to express concisely complex schema changes, (ii) tools that allow the DBA to evaluate the effects of such changes, (iii) optimized translation of old queries to work on the new schema version, (iv) automatic data migration, and (v) full documentation of intervened changes as needed to support data provenance, database flash back, and historical queries. PRISM solves these problems by integrating recent theoretical advances on mapping composition and invertibility, into a design that also achieves usability and scalability. Wikipedia and its 170+ schema versions provided an invaluable testbed for validating PRISM tools and their ability to support legacy queries.

international conference on management of data | 2011

Workload-aware database monitoring and consolidation

Carlo Curino; Evan Philip Charles Jones; Samuel Madden; Hari Balakrishnan

In most enterprises, databases are deployed on dedicated database servers. Often, these servers are underutilized much of the time. For example, in traces from almost 200 production servers from different organizations, we see an average CPU utilization of less than 4%. This unused capacity can be potentially harnessed to consolidate multiple databases on fewer machines, reducing hardware and operational costs. Virtual machine (VM) technology is one popular way to approach this problem. However, as we demonstrate in this paper, VMs fail to adequately support database consolidation, because databases place a unique and challenging set of demands on hardware resources, which are not well-suited to the assumptions made by VM-based consolidation. Instead, our system for database consolidation, named Kairos, uses novel techniques to measure the hardware requirements of database workloads, as well as models to predict the combined resource utilization of those workloads. We formalize the consolidation problem as a non-linear optimization program, aiming to minimize the number of servers and balance load, while achieving near-zero performance degradation. We compare Kairos against virtual machines, showing up to a factor of 12× higher throughput on a TPC-C-like benchmark. We also tested the effectiveness of our approach on real-world data collected from production servers at Wikia.com, Wikipedia, Second Life, and MIT CSAIL, showing absolute consolidation ratios ranging between 5.5:1 and 17:1.

ieee international conference on pervasive computing and communications | 2005

TinyLIME: bridging mobile and sensor networks through middleware

Carlo Curino; Matteo Giani; Marco Giorgetta; Alessandro Giusti; Amy L. Murphy; Gian Pietro Picco

In the rapidly developing field of sensor networks, bridging the gap between the applications and the hardware presents a major challenge. Although middleware is one solution, it must be specialized to the qualities of sensor networks, especially energy consumption. The work presented here provides two contributions: a new operational setting for sensor networks and a middleware for easing software development in this setting. The operational setting we target removes the usual assumption of a central collection point for sensor data. Instead the sensors are sparsely distributed in an environment, not necessarily able to communicate among themselves, and a set of clients move through space accessing the data of sensors nearby, yielding a system which naturally provides context relevant information to client applications. We further assume the clients are wirelessly networked and share locally accessed data. This scenario is relevant, for example, when relief workers access the information in their zone and share this information with other workers. Our second contribution, the middleware itself is an extension of LlME, our earlier work on middleware for mobile ad hoc networks. The model makes sensor data available through a tuple space interface, providing the illusion of shared memory between applications and sensors. This paper presents both the model and the implementation of our middleware incorporated with the Crossbow Mote sensor platform.

Communications of The ACM | 2009

And what can context do for data

Carlo Curino; Giorgio Orsi; Elisa Quintarelli; Rosalba Rossato; Fabio A. Schreiber; Letizia Tanca

Common to all aCtors in today’s information world is the problem of lowering the “information noise,” both reducing the amount of data to be stored and accessed, and enhancing the “precision” according to which the available data fit the application requirements. Thus, fitting data to the application needs is tantamount to fitting a dress to a person, and will be referred to as data tailoring. The context will be our scissors to tailor data, possibly assembled and integrated from many data sources. Since the 1980s, many organizations have evolved to comply with the market needs in terms of flexibility, effective customer relationship management, supply chain optimization and so on and so forth: the situation where a set of partners re-engineered their single organizations, generating a unique, extended enterprise, has frequently been observed. Together with the organizations, also their information systems evolved, embracing new technologies like XML and ontologies, used in ERP systems and Webservice based applications. In recent years many organizations introduced into their information systems also Knowledge Management features, to allow easy information sharing among the organizations’ members; these new information sources and their content have to be managed together with other – we might say legacy – enterprise data. This growth of information, if not properly controlled, leads to a data overload that may cause confusion rather than knowledge, and dramatically reduce the benefits of a rich information system. However, distinguishing useful information from noise, i.e., from all the information not relevant to the specific application, is not a trivial task; the same piece of information can be considered differently, even by the same user, in different situations, or places – in a single word, in a different context. The notion of context, formerly emerged in various fields of research like psychology and philosophy, is acquiring great importance also in the computer science field. In a commonsense interpretation, the context is perceived as a set of variables that may be of interest for an agent and that influence its actions. The context has often a significant impact on the way humans (or machines) interpret their environment: a change in context causes a transformation in the actor’s mental representation of the reality, even when the reality is not changed. The word itself, derived from the Latin cum (with or together) and texere (to weave), describes a context not just as a profile, but as an active process dealing with the way humans weave their experience within their whole environment, to give it meaning. In the last few years, sophisticated and general context models have been proposed to support context-aware applications. In the following we list the different meanings attributed to the word context: Presentation-oriented: ˲ context is perceived as the capability of the system to adapt content presentation to different channels or to different devices. These context-models are often rigid, since they are designed for specific applications and rely on a well known set of presentation variables. Location-oriented: ˲ with this family of context models, it is possible to handle and What can context do for data?

Pervasive and Mobile Computing | 2005

Mobile data collection in sensor networks: The TinyLime middleware

Carlo Curino; Matteo Giani; Marco Giorgetta; Alessandro Giusti; Amy L. Murphy; Gian Pietro Picco

In this paper we describe TinyLime, a novel middleware for wireless sensor networks that departs from the traditional setting where sensor data is collected by a central monitoring station, and enables instead multiple mobile monitoring stations to access the sensors in their proximity and share the collected data through wireless links. This intrinsically context-aware setting is demanded by applications where the sensors are sparse and possibly isolated, and where on-site, location-dependent data collection is required. An extension of the Lime middleware for mobile ad hoc networks, TinyLime makes sensor data available through a tuple space interface, providing the illusion of shared memory between applications and sensors. Data aggregation capabilities and a power-savvy architecture complete the middleware features. The paper presents the model and application programming interface of TinyLime, together with its implementation for the Crossbow MICA2 sensor platform.

very large data bases | 2008

Managing and querying transaction-time databases under schema evolution

Hyun Jin Moon; Carlo Curino; Alin Deutsch; Chien-Yi Hou; Carlo Zaniolo

The old problem of managing the history of database information is now made more urgent and complex by fast spreading web information systems, such as Wikipedia. Our PRIMA system addresses this difficult problem by introducing two key pieces of new technology. The first is a method for publishing the history of a relational database in XML, whereby the evolution of the schema and its underlying database are given a unified representation. This temporally grouped representation makes it easy to formulate sophisticated historical queries on any given schema version using standard XQuery. The second key piece of technology is that schema evolution is transparent to the user: she writes queries against the current schema while retrieving the data from one or more schema versions. The system then performs the labor-intensive and error-prone task of rewriting such queries into equivalent ones for the appropriate versions of the schema. This feature is particularly important for historical queries spanning over potentially hundreds of different schema versions and it is realized in PRIMA by (i) introducing Schema Modification Operators (SMOs) to represent the mappings between successive schema versions and (ii) an XML integrity constraint language (XIC) to efficiently rewrite the queries using the constraints established by the SMOs. The scalability of the approach has been tested against both synthetic data and real-world data from the Wikipedia DB schema evolution history.

international conference on management of data | 2015

Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications

Bikas Saha; Hitesh Shah; Siddharth Seth; Gopal Vijayaraghavan; Arun C. Murthy; Carlo Curino

The broad success of Hadoop has led to a fast-evolving and diverse ecosystem of application engines that are building upon the YARN resource management layer. The open-source implementation of MapReduce is being slowly replaced by a collection of engines dedicated to specific verticals. This has led to growing fragmentation and repeated efforts with each new vertical engine re-implementing fundamental features (e.g. fault-tolerance, security, stragglers mitigation, etc.) from scratch. In this paper, we introduce Apache Tez, an open-source framework designed to build data-flow driven processing runtimes. Tez provides a scaffolding and library components that can be used to quickly build scalable and efficient data-flow centric engines. Central to our design is fostering component re-use, without hindering customizability of the performance-critical data plane. This is in fact the key differentiator with respect to the previous generation of systems (e.g. Dryad, MapReduce) and even emerging ones (e.g. Spark), that provided and mandated a fixed data plane implementation. Furthermore, Tez provides native support to build runtime optimizations, such as dynamic partition pruning for Hive. Tez is deployed at Yahoo!, Microsoft Azure, LinkedIn and numerous Hortonworks customer sites, and a growing number of engines are being integrated with it. This confirms our intuition that most of the popular vertical engines can leverage a core set of building blocks. We complement qualitative accounts of real-world adoption with quantitative experimental evidence that Tez-based implementations of Hive, Pig, Spark, and Cascading on YARN outperform their original YARN implementation on popular benchmarks (TPC-DS, TPC-H) and production workloads.

Explore More