Is this you? Create Your Porfile

D. Agrawal

University of California, Santa Barbara

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where D. Agrawal is active.

Explore More

Publication

Featured researches published by D. Agrawal.

symposium on principles of database systems | 1997

Epidemic algorithms in replicated databases (extended abstract)

D. Agrawal; A. El Abbadi; R. Steinke

1 Introduction We present a family of epidemic algorithms for maintaining replicated data in a transactional framework. The algorithms are based on the causal delivery of log records where each record corresponds to one transaction instead of one operation. The fist algorithm in this family is a pessimistic protocol that ensures serializability and guarantees strict executions. Since we expect the epidemic algorithms to be used in environments with low probability of conflicts among transactions, we develop a variant of the pessimistic algorithm in which locks are released as soon as transactions finish their execution locally. However, this optimistic releasing of locks introduces the possibility of cascading aborts while ensuring serializable executions. The last member of thii family of epidemic algorithms is motivated from the need for asynchronous replication solutions that are being increasingly used in commercial systems. The protocol is optimistic in that transactions commit as soon as they terminate locally and inconsistencies are detected asynchronously as the effects of committed transactions propagate through the system. With the proliferation of computer networks, PCs, and workstations, new models for workplaces are emerging. In particular, organizations need to provide ready access to corporate information to users who may or may not always be connected to the database. One way to provide access to such data is through replication. However, traditional synchronous solutions for managing replicated data [Sto79, Tho79, Gif79] can not be used especially in such a distributed, mobile, and disconnected environment. As the need for replication grows, several vendors have adopted asynchronous solutions for managing repli-cate data pBH+88, Ora]. For example, Lotus Notes uses value-base& replication in which updates are performed locally and a propagation mechanism is provided to apply these updates to other replica sites. In addition, a version number is used to detect inconsistencies. Resolution of inconsistencies is left to the users. Although the Lotus approach works reasonably well for single object updates (i.e., environments such as file-systems), it fails when multiple objects are involved in a single update (i.e., transaction oriented environments). In particular, more formal mechanisms are needed for update propagation and conflict detection in the context of asynchronous replication.

Information Processing Letters | 1990

Exploiting logical structures in replicated databases

D. Agrawal; A. El Abbadi

We proposed replica control protocols where the synchronization cost is reduced by exploiting the structural information of the underlying system. We first show that distributed mutual exclusion can be achieved efficiently if a logical structure is imposed of the underlying network system. We then argue that solutions to the problem of mutual exclusion can be extended to solve the problem of replica control in a distributed database

international conference on data engineering | 2002

Multiple query optimization by cache-aware middleware using query teamwork

Kevin O'Gorman; D. Agrawal; A. El Abbadi

Queries with common sequences of disk accesses can make maximal use of a buffer pool. We developed middleware to promote the necessary conditions in concurrent query streams, and achieved a speedup of 2.99 in executing a workload derived from the TCP-H benchmark.

The Computer Journal | 1990

Integrating security with fault-tolerant distributed databases

D. Agrawal; A. El Abbadi

We address the issue of maintaining security in a fault-tolerant replicated database. We present a protocol that combines both security and reliability aspects in a database system. Although this protocol provides the desired level of security, it does so at the expense of availability. By integrating a propagation mechanism with our protocol, we are able to achieve a high level of security and availability

extending database technology | 2016

Road to freedom in big data analytics

D. Agrawal; Sanjay Chawla; Ahmed K. Elmagarmid; Zoi Kaoudi; Mourad Ouzzani; Paolo Papotti; Jorge Arnulfo Quiané-Ruiz; Nan Tang; Mohammed Javeed Zaki

The world is fast moving towards a data-driven society where data is the most valuable asset. Organizations need to perform very diverse analytic tasks using various data processing platforms. In doing so, they face many challenges; chiefly, platform dependence, poor interoperability, and poor performance when using multiple platforms. We present RHEEM, our vision for big data analytics over diverse data processing platforms. RHEEM provides a threelayer data processing and storage abstraction to achieve both platform independence and interoperability across multiple platforms. In this paper, we discuss our vision as well as present multiple research challenges that we need to address to achieve it. As a case in point, we present a data cleaning application built using some of the ideas of RHEEM. We show how it achieves platform independence and the performance benefits of following such an approach. 1. WHY TIED TO ONE SINGLE SYSTEM? Data analytic tasks may range from very simple to extremely complex pipelines, such as data extraction, transformation, and loading (ETL), online analytical processing (OLAP), graph processing, and machine learning (ML). Following the dictum “one size does not fit all” [23], academia and industry have embarked on an endless race to develop data processing platforms for supporting these different tasks, e.g., DBMSs and MapReduce-like systems. Semantic completeness, high performance, and scalability are key objectives of such platforms. While there have been major achievements in these objectives, users still face two main roadblocks. The first roadblock is that applications are tied to a single processing platform, making the migration of an application to new and more efficient platforms a difficult and costly task. Furthermore, complex analytic tasks usually require the combined use of different processing platforms. As a result, the common practice is to develop several specialized analytic applications on top of different platforms. This requires users to manually combine the results to draw a conclusion. In addition, users may need to re-implement existing applications on top of faster processing platforms when ∗Work done while at QCRI. c ©2016, Copyright is with the authors. Published in Proc. 19th International Conference on Extending Database Technology (EDBT), March 15-18, 2016 Bordeaux, France: ISBN 978-3-89318-070-7, on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 these become available. For example, Spark SQL [3] and MLlib [2] are the Spark counterparts of Hive [24] and Mahout [1]. The second roadblock is that datasets are often produced by different sources and hence they natively reside on different storage platforms. As a result, users often perform tedious, time-intensive, and costly data migration and integration tasks for further analysis. Let us illustrate these roadblocks with an Oil & Gas industry example [13]. A single oil company can produce more than 1.5TB of diverse data per day [6]. Such data may be structured or unstructured and come from heterogeneous sources, such as sensors, GPS devices, and other measuring instruments. For instance, during the exploration phase, data has to be acquired, integrated, and analyzed in order to predict if a reservoir would be profitable. Thousands of downhole sensors in exploratory wells produce real-time seismic data for monitoring resources and environmental conditions. Users integrate these data with the physical properties of the rocks to visualize volume and surface renderings. From these visualizations, geologists and geophysicists formulate hypotheses and verify them with ML methods, such as regression and classification. Training of the models is performed with historical drilling and production data, but oftentimes users have to go over unstructured data, such as notes exchanged by emails or text from drilling reports filed in a cabinet. Thus, an application supporting such a complex analytic pipeline has to access several sources for historical data (relational, but also text and semi-structured), remove the noise from the streaming data coming from the sensors, and run both traditional (such as SQL) and statistical analytics (such as ML algorithms) over different processing platforms. Similar examples can be drawn from many other domains such as healthcare: e.g., IBM reported that North York hospital needs to process 50 diverse datasets, which are on a dozen different internal systems [15]. These emerging applications clearly show the need for complex analytics coupled with a diversity of processing platforms, which raises two major research challenges. Data Processing Challenge. Users are faced with various choices on where to process their data, each choice with possibly orders of magnitude differences in terms of performance. However, users have to be intimate with the intricacies of the processing platform to achieve high efficiency and scalability. Moreover, once a decision is taken, users may end up being tied up to a particular platform. As a result, migrating the data analytics stack to a more efficient processing platform often becomes a nightmare. Thus, there is a need to build a system that offers data processing platform independence. Furthermore, complex analytic applications require executing tasks over different processing platforms to achieve high performance. For example, one may aggregate large datasets with traditional queries on top of a relational database such as PostgreSQL, but ML tasks might be much faster if executed on Spark [28]. HowVisionary Paper Series ISSN: 2367-2005 479 10.5441/002/edbt.2016.45 ever, this requires a considerable amount of manual work in selecting the best processing platforms, optimizing tasks for the chosen platforms, and coordinating task execution. Thus, this also calls for multi-platform task execution. Data Storage Challenge. Data processing platforms are typically tightly coupled with a specific storage solution. Moving data from a certain storage (e.g., a relational DB) to a more suitable processing platform for the actual task (e.g., Spark on HDFS) requires shuffling data between different systems. Such shuffling may end up dominating the execution time. Moreover, different departments in the same organization may go for different storage engines due to legacy as well as performance reasons. Dealing with such heterogeneity calls for data storage independence. To tackle these two challenges, we envision a system, called RHEEM1, that provides both platform independence and interoperability (Section 2). In the following, we first discuss our vision for the data processing abstraction (Section 3), which is fully based on user-defined functions (UDFs) to provide adaptability as well as extensibility. This processing abstraction allows both users to focus only on the logic of their data analytic tasks and applications to be independent from the data processing platforms. We then discuss how to divide a complex analytic task into smaller subtasks to exploit the availability of different processing platforms (Section 4). As a result, RHEEM can run simultaneously a single data analytic task over multiple processing platforms to boost performance. Next, we present our first attempt to build an instance application based on some of the ideas of RHEEM and the resulting benefits (Section 5). We then show how we push down the processing abstraction idea to the storage layer (Section 6). This storage abstraction allows both users to focus on their storage needs and the processing platforms to be independent from the storage engines. Some initial efforts are also going into the direction of providing data processing platform independence [11,12,21] (Section 7). However, our vision goes beyond the data processing. We not only envision a data processing abstraction but also a data storage abstraction, allowing us to consider data movement costs during task optimization. We give a research agenda highlighting the challenges that need to be tackled to build RHEEM in Section 8.

international conference on management of data | 2016

Rheem: Enabling Multi-Platform Task Execution

D. Agrawal; Lamine Ba; Laure Berti-Equille; Sanjay Chawla; Ahmed K. Elmagarmid; Hossam M. Hammady; Yasser Idris; Zoi Kaoudi; Zuhair Khayyat; Sebastian Kruse; Mourad Ouzzani; Paolo Papotti; Jorge Arnulfo Quiané-Ruiz; Nan Tang; Mohammed Javeed Zaki

Many emerging applications, from domains such as healthcare and oil & gas, require several data processing systems for complex analytics. This demo paper showcases system, a framework that provides multi-platform task execution for such applications. It features a three-layer data processing abstraction and a new query optimization approach for multi-platform settings. We will demonstrate the strengths of system by using real-world scenarios from three different applications, namely, machine learning, data cleaning, and data fusion.

symposium on principles of database systems | 1994

Relative serializability (extended abstract): an approach for relaxing the atomicity of transactions

D. Agrawal; John L. Bruno; A. El Abbadi; V. Krishnaswamy

In the presence of semantic information, serializability is too strong a correctness criterion and unnecessarily restricts concurrency. We use the semantic information of a transaction to provide different atomicity views of the transaction to other transactions. The proposed approach improves concurrency and allows interleavings among transactions which are non-serializable, but which nonetheless preserve the consistency of the database and are acceptable to other users. We develop a graph-based tool whose acyclicity is both a necessary and sufficient condition for the correctness of an execution. Our theory encompasses earlier proposals that incorporate semantic information of transactions. Furthermore it is the first approach that provides an efficient graph based tool for recognizing correct schedules without imposing any restrictions on the application domain. Our approach is widely applicable to many advanced database applications such as systems with long-lived transactions and collaborative environments.

international conference on multimedia computing and systems | 1997

Browsing and placement of multiresolution images on secondary storage

Sunil Prabhakar; D. Agrawal; A. El Abbadi; Ambuj K. Singh; T. Smith

Image decomposition techniques such as wavelets are used to provide multiresolution representations of images. The original image is represented by several coefficients, one of them with visual similarity to the original image, but at a lower resolution. Several strategies are evaluated to store the image coefficients on parallel disks so that thumbnail browsing as well as image reconstruction can be performed efficiently. Disk simulation and experiments with real disks are used to evaluate the performance of these strategies. The results indicate that significant performance improvements can be achieved with as few as four disks by placing image coefficients based upon browsing access patterns.

principles of distributed computing | 1997

Using broadcast primitives in replicated databases (abstract)

Ioana Stanoi; D. Agrawal; A. El Abbadi

Recently there has been increasing interest in the development of broadcast protocols for disseminating information in distributed systems. Several protocols with varying properties have been proposed and implemented. One of the commonly cited applications is the management of replicated data. Most prior attempts concentrated on using atomic broadcast for file-like applications where single operations need to be executed on multiple copies of a file. We, on the other hand, explore the use of the simpler variants of broadcast protocols for managing replicated databases where the unit of activity is a transaction consisting of multiple operationa that need to be executed atomically as a unit. We assume a filly replicated database with tw~phaee locking. Reliable broadcod is a simple communication primitive that is easy to implement and guarantees eventual delivery. We make use of this guarantee to remove the need for explicit acknowledgement after every remote interaction. We adapt the read-one write-all protocol to a reliable broadcastbaaed system aa follows. A read operation is executed locally by acquiring a read lock. A write operation is executed by reliably broadcasting it to all database sites. On delivery, a site either acquires a write lock or the write operation is delayed until the lock can be granted. The transaction proceeds to execnte the next operation without waiting for afl write lrrcke to be granted, including at the initiator. When the initiating site decides to commit a transaction, it reliably broadcasts a commit requeet to all the sites. On delivery, a site checks if the transaction has any pending write operations. In this case, the acknowledgment for the commit is negative, otherwise positive. A transaction commits if all acknowledgements are positive, otherwise it aborts. This protocol avoids global deadlocks and read-only transactions are never aborted. We next examined the advantage of using causal broadcast for replicated databases. If we use a causal broadcast primitive instead of a reliable broadcast primitive in the above protocol, the modified protocol remains correct. The use of the different broadcast primitives offers some interest-

Distributed and Parallel Databases | 1994

A nonrestrictive concurrency control protocol for object-oriented databases

D. Agrawal; A. El Abbadi

We propose an algorithm for executing transactions in object-oriented databases. The object-oriented database model generalizes the classical model of database concurrency control by permitting accesses toclass andinstance objects, by permittingarbitrary operations on objects as opposed to traditional read and write operations, and by allowingnested execution of transactions on objects. In this paper, we first develop a uniform methodology for treating both classes and instances. We then develop a two-phase locking protocol with a new relationship between locks calledordered sharing for an object-oriented database. Ordered sharing does not restrict the execution of conflicting operations. Finally, we extend the protocol to handle objects that execute methods on other objects thus resulting in the nested execution of transactions. The resulting protocol permits more concurrency than other known locking-based protocols.

Explore More