Jun Rao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jun Rao is active.

Explore More

Publication

Featured researches published by Jun Rao.

international conference on management of data | 2010

A comparison of join algorithms for log processing in MaPreduce

Spyros Blanas; Jignesh M. Patel; Vuk Ercegovac; Jun Rao; Eugene J. Shekita; Yuanyuan Tian

The MapReduce framework is increasingly being used to analyze large volumes of data. One important type of data analysis done with MapReduce is log processing, in which a click-stream or an event log is filtered, aggregated, or mined for patterns. As part of this analysis, the log often needs to be joined with reference data such as information about users. Although there have been many studies examining join algorithms in parallel and distributed DBMSs, the MapReduce framework is cumbersome for joins. MapReduce programmers often use simple but inefficient algorithms to perform joins. In this paper, we describe crucial implementation details of a number of well-known join strategies in MapReduce, and present a comprehensive experimental comparison of these join techniques on a 100-node Hadoop cluster. Our results provide insights that are unique to the MapReduce platform and offer guidance on when to use a particular join algorithm on this platform.

international conference on management of data | 2002

Automating physical database design in a parallel database

Jun Rao; Chun Zhang; Nimrod Megiddo; Guy M. Lohman

Physical database design is important for query performance in a shared-nothing parallel database system, in which data is horizontally partitioned among multiple independent nodes. We seek to automate the process of data partitioning. Given a workload of SQL statements, we seek to determine automatically how to partition the base data across multiple nodes to achieve overall optimal (or close to optimal) performance for that workload. Previous attempts use heuristic rules to make those decisions. These approaches fail to consider all of the interdependent aspects of query performance typically modeled by todays sophisticated query optimizers.We present a comprehensive solution to the problem that has been tightly integrated with the optimizer of a commercial shared-nothing parallel database system. Our approach uses the query optimizer itself both to recommend candidate partitions for each table that will benefit each query in the workload, and to evaluate various combinations of these candidates. We compare a rank-based enumeration method with a random-based one. Our experimental results show that the former is more effective.

international conference on data engineering | 2006

Compiled Query Execution Engine using JVM

Jun Rao; Hamid Pirahesh; C. Mohan; Guy M. Lohman

A conventional query execution engine in a database system essentially uses a SQL virtual machine (SVM) to interpret a dataflow tree in which each node is associated with a relational operator. During query evaluation, a single tuple at a time is processed and passed among the operators. Such a model is popular because of its efficiency for pipelined processing. However, since each operator is implemented statically, it has to be very generic in order to deal with all possible queries. Such generality tends to introduce significant runtime inefficiency, especially in the context of memory-resident systems, because the granularity of data commercial system, using SVM. processing (a tuple) is too small compared with the associated overhead. Another disadvantage in such an engine is that each operator code is compiled statically, so query-specific optimization cannot be applied. To improve runtime efficiency, we propose a compiled execution engine, which, for a given query, generates new query-specific code on the fly, and then dynamically compiles and executes the code. The Java platform makes our approach particularly interesting for several reasons: (1) modern Java Virtual Machines (JVM) have Just- In-Time (JIT) compilers that optimize code at runtime based on the execution pattern, a key feature that SVMs lack; (2) because of Java’s continued popularity, JVMs keep improving at a faster pace than SVMs, allowing us to exploit new advances in the Java runtime in the future; (3) Java is a dynamic language, which makes it convenient to load a piece of new code on the fly. In this paper, we develop both an interpreted and a compiled query execution engine in a relational, Java-based, in-memory database prototype, and perform an experimental study. Our experimental results on the TPC-H data set show that, despite both engines benefiting from JIT, the compiled engine runs on average about twice as fast as the interpreted one, and significantly faster than an in-memory

international conference on management of data | 1998

Reusing invariants: a new strategy for correlated queries

Jun Rao; Kenneth A. Ross

Correlated queries are very common and important in decision support systems. Traditional nested iteration evaluation methods for such queries can be very time consuming. When they apply, query rewriting techniques have been shown to be much more efficient. But query rewriting is not always possible. When query rewriting does not apply, can we do something better than the traditional nested iteration methods? In this paper, we propose a new invariant technique to evaluate correlated queries efficiently. The basic idea is to recognize the part of the subquery that is not related to the outer references and cache the result of that part after its first execution. Later, we can reuse the result and combine it with the result of the rest of the subquery that is changing for each iteration. Our technique applies to arbitrary correlated subqueries. This paper introduces algorithms to recognize the invariant part of a data flow tree, and to restructure the evaluation plan to reuse the stored intermediate result. We also propose an efficient method to teach an existing join optimizer to understand the invariant feature and thus allow it to be able to generate better join plans in the new context. Some other related optimization techniques are also discussed. The proposed techniques were implemented within three months on an existing real commercial database system. We also experimentally evaluate our proposed technique. Our evaluation indicates that, when query rewriting is not possible, the invariant technique is significantly better than the traditional nested iteration method. Even when query rewriting applies, the invariant technique is sometimes better than the query rewriting technique. Our conclusion is that the invariant technique should be considered as one of the alternatives in evaluating correlated queries since it fills the gap left by rewriting techniques.

international conference on data engineering | 2001

Using EELs, a practical approach to outerjoin and antijoin reordering

Jun Rao; Bruce G. Lindsay; Guy M. Lohman; Hamid Pirahesh; David E. Simmen

Outerjoins and antijoins are two important classes of joins in database systems. Reordering outerjoins and antijoins with innerjoins is challenging because not all the join orders preserve the semantics of the original query. Previous work did not consider antijoins and was restricted to a limited class of queries. We consider using a conventional bottom-up optimizer to reorder different types of joins. We propose extending each join predicates eligibility list, which contains all the tables referenced in the predicate. An extended eligibility list (EEL) includes all the tables needed by a predicate to preserve the semantics of the original query. We describe an algorithm that can set up the EELs properly in a bottom-up traversal of the original operator tree. A conventional join optimizer is then modified to check the EELs when generating sub-plans. Our approach handles antijoin and can resolve many practical issues. It is now being implemented in an upcoming release of IBMs Universal Database Server for Unix, Windows and OS/2.

Information Systems | 2001

Adapting materialized views after redefinitions: techniques and a performance study

Ashish Gupta; Inderpal Singh Mumick; Jun Rao; Kenneth A. Ross

Abstract We consider a variant of the view maintenance problem: How does one keep a materialized view up-to-date when the view definition itself changes? Can one do better than recomputing the view from the base relations? Traditional view maintenance tries to maintain the materialized view in response to modifications to the base relations; we try to “adapt” the view in response to changes in the view definition. Such techniques are needed for applications where the user can change queries dynamically and wants to see the changes in the results fast. Data archaeology, data visualization, and dynamic queries are examples of such applications. Views defined over the Internet tend to evolve and our technique can be useful for adapting such views. We consider all possible redefinitions of SQL SELECT - FROM - WHERE - GROUP - BY - HAVING , UNION , and EXCEPT views, and show how these views can be adapted using the old materialization for the cases where it is possible to do so. We identify extra information that can be kept with a materialization to facilitate redefinition. Multiple simultaneous changes to a view can be handled without necessarily materializing intermediate results. We identify guidelines for users and database administrators that can be used to facilitate efficient view adaptation. We perform a systematic experimental evaluation of our proposed techniques. Our evaluation indicates that adaptation is much more efficient than rematerialization in most cases. In-place adaptation methods are better than the non-in-place methods when the change is small. We also point out some important factors that can impact the efficiency of adaptation.

cloud data management | 2009

Leveraging a scalable row store to build a distributed text index

Ning Li; Jun Rao; Eugene J. Shekita; Sandeep Tata

Many content-oriented applications require a scalable text index. Building such an index is challenging. In addition to the logic of inserting and searching documents, developers have to worry about issues in a typical distributed environment, such as fault tolerance, incrementally growing the index cluster, and load balancing. We developed a distributed text index called HIndex, by judiciously exploiting the control layer of HBase, which is an open source implementation of Googles Bigtable. Such leverage enables us to inherit the support on availability, elasticity and load balancing in HBase. We present the design, implementation, and a performance evaluation of HIndex in this paper.

Archive | 2000

Power-Pipelining for Enhanced Query Performance

Jun Rao; Kenneth A. Ross

As random access memory gets cheaper, it becomes increasingly affordable to build computers with large main memories. In this paper, we consider processing queries within the context of a main memory database system and try to enhance the query execution engine of such a system. An execution plan is usually represented as an operator tree. Traditional query execution engines evaluate a query by recursively iterating each operator and returning exactly one tuple result for each iteration. This generates a large number of function calls. In a main-memory database system, the cost of a function call is relatively high. We propose a new evaluation method called power-pipelining. Each operator processes and passes many tuples instead of one tuple each time. We keep the number of matches generated in a join in a local counter array. We reconstitute the run-length encoding of the join result (broken down by input tables) at the end of the join processing or when necessary. We describe an overflow handling mechanism that allows us to always evaluate a query without using too much space. Besides the benefit of reducing the number of function calls, power-pipelining compresses the join result automatically and thus can reduce the transmission cost. However, power-pipelining may not always outperform the traditional pipelining method when the join selectivity is low. To get the benefit of both techniques, we propose to use them together in an execution engine and let the optimizer choose the preferred execution. We discuss how to incorporate this in a cost-based query optimizer. We implemented power-pipelining and the reconstitution algorithm in the Columbia main memory database system. We present a performance study of power-pipelining and the traditional pipelining method. We show that the improvement of using power-pipelining can be very significant. As a result, we believe that power-pipelining is a useful and feasible technique and should be incorporated into main memory database query execution engines. ∗This research was supported by a David and Lucile Packard Foundation Fellowship in Science and Engineering, by an NSF Young Investigator Award, by NSF grant number IIS-98-12014, and by NSF CISE award CDA-9625374.

very large data bases | 2004