Ming-Chuan Wu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ming-Chuan Wu is active.

Explore More

Publication

Featured researches published by Ming-Chuan Wu.

international conference on data engineering | 1998

Encoded bitmap indexing for data warehouses

Ming-Chuan Wu; Alejandro P. Buchmann

Complex query types, huge data volumes, and very high read/update ratios make the indexing techniques designed and tuned for traditional database systems unsuitable for data warehouses (DW). We propose an encoded bitmap indexing for DWs which improves the performance of known bitmap indexing in the case of large cardinality domains. A performance analysis and theorems which identify properties of good encodings for better performance are presented. We compare encoded bitmap indexing with related techniques, such as bit slicing, projection-, dynamic-, and range-based indexing.

very large data bases | 2012

SCOPE: parallel databases meet MapReduce

Jingren Zhou; Nicolas Bruno; Ming-Chuan Wu; Per-Ake Larson; Ronnie Chaiken; Darren A. Shakib

Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets, such as search logs, click streams, and web graph data. For cost and performance reasons, processing is typically done on large clusters of tens of thousands of commodity machines. Such massive data analysis on large clusters presents new opportunities and challenges for developing a highly scalable and efficient distributed computation system that is easy to program and supports complex system optimization to maximize performance and reliability. In this paper, we describe a distributed computation system, Structured Computations Optimized for Parallel Execution (Scope), targeted for this type of massive data analysis. Scope combines benefits from both traditional parallel databases and MapReduce execution engines to allow easy programmability and deliver massive scalability and high performance through advanced optimization. Similar to parallel databases, the system has a SQL-like declarative scripting language with no explicit parallelism, while being amenable to efficient parallel execution on large clusters. An optimizer is responsible for converting scripts into efficient execution plans for the distributed computation engine. A physical execution plan consists of a directed acyclic graph of vertices. Execution of the plan is orchestrated by a job manager that schedules execution on available machines and provides fault tolerance and recovery, much like MapReduce systems. Scope is being used daily for a variety of data analysis and data mining applications over tens of thousands of machines at Microsoft, powering Bing, and other online services.

BTW | 1997

Research Issues in Data Warehousing

Ming-Chuan Wu; Alejandro P. Buchmann

Data warehousing is a booming industry with many interesting research problems. The database research community has concentrated on only a few aspects. In this paper, We summarize the state of the art, suggest architectural extensions and identify research problems in the areas of warehouse modeling and design, data cleansing and loading, data refreshing and purging, metadata management, extensions to relational operators, alternative implementations of traditional relational operators, special index structures and query optimization with aggregates.

international conference on management of data | 1999

Query optimization for selections using bitmaps

Ming-Chuan Wu

Bitmaps are popular indexes for data warehouse (DW) applications and most database management systems offer them today. This paper proposes query optimization strategies for selections using bitmaps. Both continuous and discrete selection criteria are considered. Query optimization strategies are categorized into static and dynamic. Static optimization strategies discussed are the optimal design of bitmaps, and algorithms based on tree and logical reduction. The dynamic optimization discussed is the approach of inclusion and exclusion for both bit-sliced indexes and encoded bitmap indexes.

very large data bases | 2003

Statistics on views

Cesar A. Galindo-Legaria; Milind M. Joshi; Florian Waas; Ming-Chuan Wu

The quality of execution plans generated by a query optimizer is tied to the accuracy of its cardinality estimation. Errors in estimation lead to poor performance, erratic behavior, and user frustration. Traditionally, the optimizer is restricted to use only statistics on base table columns and derive estimates bottom-up. This approach has shortcomings with dealing with complex queries, and with rich languages such as SQL: Errors grow as estimation is done on top of estimation, and some constructs are simply not handled. In this paper we describe the creation and utilization of statistics on views in SQL Server, which provides the optimizer with statistical information on the result of scalar or relational expressions. It opens a new dimension on the data available for cardinality estimation and enables arbitrary correction. We describe the implementation of this feature in the optimizer architecture, and show its impact on the quality of plans generated through a number of examples.

very large data bases | 2014

Advanced join strategies for large-scale distributed computation

Nicolas Bruno; YongChul Kwon; Ming-Chuan Wu

Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets (e.g., search logs, click streams, and web graph data). For cost and performance reasons, processing is typically done on large clusters of thousands of commodity machines by using high level scripting languages. In the recent past, there has been significant progress in adapting well-known techniques from traditional relational DBMSs to this new scenario. However, important challenges remain open. In this paper we study the very common join operation, discuss some unique challenges in the large-scale distributed scenario, and explain how to efficiently and robustly process joins in a distributed way. Specifically, we introduce novel execution strategies that leverage opportunities not available in centralized scenarios, and others that robustly handle data skew. We report experimental validations of our approaches on Scope production clusters, which power the Applications and Services Group at Microsoft.

international conference on management of data | 2012

Recurring job optimization in scope

Nicolas Bruno; Sameer Agarwal; Srikanth Kandula; Bing Shi; Ming-Chuan Wu; Jingren Zhou

An increasing number of applications require distributed data storage and processing infrastructure over large clusters of commodity hardware for critical business decisions. The MapReduce programming model [2] helps programmers write distributed applications on large clusters, but requires dealing with complex implementation details (e.g., reasoning with data distribution and overall system configuration). Recent proposals, such as Scope[1], raise the level of abstraction by providing a declarative language that not only increases programming productivity but is also amenable to sophisticated optimization. Like in traditional database systems, such optimization relies on detailed data statistics to choose the best execution plan in a cost-based fashion. However, in contrast to database systems, it is very difficult to obtain and maintain good quality statistics in a highly distributed environment that contains tens of thousands of machines. First, it is very challenging to efficiently combine a large number of individually collected local complex statistical information (e.g., histograms, distinct values) in a statistically meaningful way. Second, calculating statistics typically requires scans over the full dataset. Such operation can be overwhelmingly expensive for terabytes of data. Third, even if we can collect statistics for base tables, the nature of user scripts, which typically rely on userdefined code, makes the problem of statistical inference beyond selection and projection even more difficult during optimization. Finally, the cost of user defined code is another important source of information for cost-based query optimization. Such information is crucial for the optimizer to choose the optimal degree of parallelism for the final execution plan and when and where to execute the user code. It is challenging, if not impossible, to estimate its actual cost before running the query with the real dataset. We leverage the fact that a large proportion of scripts in this environment are parametric and recurring over a time series of data. The input datasets usually come in regularly,

international conference on data engineering | 2005

Distributed/heterogeneous query processing in Microsoft SQL server

José A. Blakeley; Conor Cunningham; Nigel R. Ellis; Balaji Rathakrishnan; Ming-Chuan Wu

This paper presents an architecture overview of the distributed, heterogeneous query processor (DHQP) in the Microsoft SQL server database system to enable queries over a large collection of diverse data sources. The paper highlights three salient aspects of the architecture. First, the system introduces well-defined abstractions such as connections, commands, and rowsets that enable sources to plug into the system. These abstractions are formalized by the OLE DB data access interfaces. The generality of OLE DB and its broad industry adoption enables our system to reach a very large collection of diverse data sources ranging from personal productivity tools, to database management systems, to file system data. Second, the DHQP is built-in to the relational optimizer and execution engine of the system. This enables DH queries and updates to benefit from the cost-based algebraic transformations and execution strategies available in the system. Finally, the architecture is inherently extensible to support new data sources as they emerge as well as serves as a key extensibility point for the relational engine to add new features such as full-text search and distributed partitioned views.

International Journal of Cooperative Information Systems | 1996

A HYPERRELATIONAL APPROACH TO INTEGRATION AND MANIPULATION OF DATA IN MULTIDATABASE SYSTEMS

Chiang Lee; Ming-Chuan Wu

The issue of interoperability among multiple autonomous databases has attracted a lot of attention from researchers in these years. The past research on this issue can be roughly divided into two main categories: the tightly-integrated approach that integrate databases by building an integrated schema and the loosely-integrated approach that achieves interoperability by using a multidatabase language. Most of the past efforts focused on the issues in the first approach. The problem with the first approach is, however, that it lacks a convenient representation of the integrated schema at the system level and a sound mathematical basis for data manipulation in a multidatabase system. In this paper, we propose to use hyperrelations as a powerful and succinct model for the global level representation of heterogeneous database schemas. A hyperrelation has the structure of a relation, but its contents are the schemas of the semantically equivalent local relations in the databases. With this representation, the metadata of the global database, local databases and the data of these databases are all representable by using the structure of a relation. The impact of such a representation is that all the elegant features of relational systems can be easily extended to multidatabase systems. A hyperrelational algebra is designed accordingly. This algebra is performed at the multidatabase systems (MDBS) level such that query transformation and optimization is supported on a sound mathematical basis. The major contributions of this paper include: (1) Local relations of various schemas (even though they retain information of the same semantics) can be systematically mapped to hyperrelations. As the structure of a hyperrelation is similar to that of a relation, data manipulation and management tasks (such as design of the global query language and the view mechanism) are greatly facilitated. (2) The hyperrelational algebra provides a sound basis for query transformation and optimization in a MDBS.

international workshop on testing database systems | 2012

Scope playback: self-validation in the cloud

Ming-Chuan Wu; Jingren Zhou; Nicolas Bruno; Yu Zhang; Jon Fowler

The last decade witnessed the emergence of various distributed storage and computation systems for cloud-scale data processing. Scope is the distributed computation platform targeted for a variety of data analysis and data mining applications, powering Bing and other online services at Microsoft. Scope combines benefits of both traditional parallel databases and MapReduce execution engines to allow easy programmability. It features a SQL-like declarative scripting language with .NET extensions, and delivers massive scalability and high performance through advanced optimization. Scope currently operates over tens of thousands of machines and processes over a million jobs per month. Such massive data computation platform presents new challenges and opportunities for efficient and effective testing and validation. Traditional approaches for testing database systems are not always sufficient due to several factors. Model-based query generation typically fails to provide coverage of user-defined code, which is very common in Scope scripts. Additionally, rapid release cycles in the platform-as-a-service environment require tools to quickly identify potential regressions, predict the impact of breaking changes, and provide massive test coverage in a short amount of time. In this paper, we describe a test automation tool, denoted by Scope Playback, that addresses these new requirements. Scope Playback leverages the Scope system itself in two important ways. First, it exploits data about every job submitted to production clusters, which is automatically stored by the Scope system. Second, the testing process itself is implemented as a Scope script, automatically benefiting from transparent and massive computation parallelism. Scope Playback currently serves as one crucial validation technique and ensures product quality during Scope release cycles.

Explore More