Dongyao Wu
NICTA
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dongyao Wu.
IEEE Software | 2016
Dongyao Wu; Liming Zhu; Xiwei Xu; Sherif Sakr; Daniel Sun; Qinghua Lu
Many real-world data analysis scenarios require pipelining and integration of multiple (big) data-processing and data-analytics jobs, which often execute in heterogeneous environments, such as MapReduce; Spark; or R, Python, or Bash scripts. Such a pipeline requires much glue code to get data across environments. Maintaining and evolving these pipelines are difficult. Pipeline frameworks that try to solve such problems are usually built on top of a single environment. They might require rewriting the original job to take into account a new API or paradigm. The Pipeline61 framework supports the building of data pipelines involving heterogeneous execution environments. It reuses the existing code of the deployed jobs in different environments and provides version control and dependency management that deals with typical software engineering issues. A real-world case study shows its effectiveness. This article is part of a special issue on Software Engineering for Big Data Systems.
quality of software architectures | 2015
Donna Xu; Dongyao Wu; Xiwei Xu; Liming Zhu; Len Bass
Conducting (big) data analytics in an organization is not just about using a processing framework (e.g. Hadoop/Spark) to learn a model from data currently in a single file system (e.g. HDFS). We frequently need to pipeline real time data from other systems into the processing framework, and continually update the learned model. The processing frameworks need to be easily invokable for different purposes to produce different models. The model and the subsequent model updates need to be integrated with a product that may require a real time prediction using the latest trained model. All these need to be shared among different teams in the organization for different data analytics purposes. In this paper, we propose a real time data-analytics-as-service architecture that uses RESTful web services to wrap and integrate data services, dynamic model training services (supported by big data processing framework), prediction services and the product that uses the models. We discuss the challenges in wrapping big data processing frameworks as services and other architecturally significant factors that affect system reliability, real time performance and prediction accuracy. We evaluate our architecture using a log-driven system operation anomaly detection system where staleness of data used in model training, speed of model update and prediction are critical requirements.
Handbook of Big Data Technologies | 2017
Dongyao Wu; Sherif Sakr; Liming Zhu
Big Data programming models represent the style of programming and present the interfaces paradigm for developers to write big data applications and programs. Programming models normally the core feature of big data frameworks as they implicitly affects the execution model of big data processing engines and also drives the way for users to express and construct the big data applications and programs. In this chapter, we comprehensively investigate different programming models for big data frameworks with comparison and concrete code examples.
international conference on big data | 2015
Dongyao Wu; Sherif Sakr; Liming Zhu; Qinghua Lu
Over the past years, frameworks such as MapReduce and Spark have been introduced to ease the task of developing big data programs and applications. However, the jobs in these frameworks are roughly defined and packaged as executable jars without any functionality being exposed or described. This means that deployed jobs are not natively composable and reusable for subsequent development. Besides, it also hampers the ability for applying optimizations on the data flow of job sequences and pipelines. In this paper, we present the Hierarchically Distributed Data Matrix (HDM) which is a functional, strongly-typed data representation for writing composable big data applications. Along with HDM, a runtime framework is provided to support the execution of HDM applications on distributed infrastructures. Based on the functional data dependency graph of HDM, multiple optimizations are applied to improve the performance of executing HDM jobs. The experimental results show that our optimizations can achieve improvements of between 10% to 60% of the Job-Completion-Time for different types of operation sequences when compared with the current state of art, Apache Spark.
Handbook of Big Data Technologies | 2017
Dongyao Wu; Sherif Sakr; Liming Zhu
Data and storage models are the basis for big data ecosystem stacks. While storage model captures the physical aspects and features for data storage, data model captures the logical representation and structures for data processing and management. Understanding storage and data model together is essential for understanding the built-on big data ecosystems. In this chapter we are going to investigate and compare the key storage and data models in the spectrum of big data frameworks.
trust, security and privacy in computing and communications | 2016
Huijun Wu; Liming Zhu; Kai Lu; Gen Li; Dongyao Wu
Parallel file systems are important infrastructures for both cloud and high performance computing. The performance of metadata operations is critical to achieve high scalability in parallel file systems. Nevertheless, traditional parallel file systems are lack of scalable metadata service. To alleviate these problems, some previous research distributes metadata to separated large-scale clusters and uses write-optimized techniques like log-structured merge tree (LSM-tree) to store metadata. However, LSM-tree design does not consider the features of solid state drive devices (SSD) which are widely deployed in modern parallel computing systems. The design of using LSM-trees to store metadata has not explored the potential benefits of SSD devices. In this paper, we present StageFS, which is a parallel file system optimized for SSD based clusters. StageFS stores both the metadata and small files in LSM-trees for fast indexing. For larger files, the file blocks are separately stored to reduce the write amplifications. In addition, the parallel I/O feature of SSD devices is used to improve the performance of accessing directories and large files. To avoid frequent small writes, StageFS uses buffering to better utilize the bandwidth of SSD devices. Experimental results show that StageFS provides better performance in metadata operations (up to 21.28x) and small file access (1.92x to two orders of magnitude) compared with Ceph and HDFS.
IEEE Software | 2016
Dongyao Wu; Liming Zhu; Xiwei Xu; Sherif Sakr; Qinghua Lu; Daniel Sun
Many real-world data analysis scenarios require pipelining and integration of multiple (big) data wrangling and analytics jobs, which are often executed in heterogeneous environments, such as MapReduce, Spark, R/Python/Bash scripts. For such a pipeline, a large amount of glue code has to be written to get data across environments. Maintaining and evolving such pipelines is difficult. Existing pipeline frameworks trying to solve such problems are usually built on top of a single environment, and/or require the original job to be re-written against a new APIs or paradigm. In this article, we propose Pipeline61, a framework that supports the building of data pipelines involving heterogeneous execution environments. Pipeline61 reuses the existing code of the deployed jobs in different environments and also provides version control and dependency management that deals with typical software engineering issues. A real-world case study is used to show the effectiveness of Pipeline61 over the state-of-the-art.
IEEE Transactions on Big Data | 2018
Dongyao Wu; Liming Zhu; Qinghua Lu; Sherif Sakr
Over the past years, frameworks such as MapReduce and Spark have been introduced to ease the task of developing big data programs and applications. However, the jobs in these frameworks are roughly defined and packaged as executable jars without any functionality being exposed or described. This means that deployed jobs are not natively composable and reusable for subsequent development. Besides, it also hampers the ability for applying optimizations on the data flow of job sequences and pipelines. In this paper, we present the Hierarchically Distributed Data Matrix (HDM) which is a functional, strongly-typed data representation for writing composable big data applications. Along with HDM, a runtime framework is provided to support the execution, integration and management of HDM applications on distributed infrastructures. Based on the functional data dependency graph of HDM, multiple optimizations are applied to improve the performance of executing HDM jobs. The experimental results show that our optimizations can achieve improvements between 10 to 40 percent of the Job-Completion-Time for different types of applications when compared with the current state of art, Apache Spark.
ieee acm international symposium cluster cloud and grid computing | 2017
Dongyao Wu; Sherif Sakr; Liming Zhu; Huijun Wu
Big data are increasingly collected and stored in a highly distributed infrastructures due to the development of sensor network, cloud computing, IoT and mobile computing among many other emerging technologies. In practice, the majority of existing big-data-processing frameworks (e.g., Hadoop and Spark) are designed based on the single-cluster setup with the assumptions of centralized management and homogeneous connectivity which makes them sub-optimal and sometimes infeasible to apply for scenarios that require implementing data analytics jobs on highly distributed data sets (across racks, data centers or multi-organizations). In order to tackle this challenge, we present HDM-MC, a multi-cluster big data processing framework which is designed to enable the capability of performing large scale data analytics across multi-clusters with minimum extra overhead due to additional scheduling requirements. In this paper, we present the architecture and realization of the system. In addition, we evaluate the performance of our framework in comparison to other state-of-art single cluster big data processing frameworks.
international conference on cloud computing | 2015
Qinghua Lu; Liming Zhu; He Zhang; Dongyao Wu; Zheng Li; Xiwei Xu
MapReduce has become the standard model for supporting big data analytics. In particular, MapReduce job optimization has been widely considered to be crucial in the implementations of big data analytics. However, there is still a lack of guidelines especially for practitioners to understand how the MapReduce jobs can be optimized. This paper aims to systematic identify and taxonomically classify the existing work on job optimization. We conducted a mapping study on 47 selected papers that were published between 2004 and 2014. We classified and compared the selected papers based on a 5WH-based characterization framework. This study generates a knowledge base of current job optimization solutions and also identifies a set of research gaps and opportunities. This study concludes that job optimization is still in an early stage of maturity. More attentions need to be paid to the cross-data center, cluster or rack job optimization to improve communication efficiency.
Collaboration
Dive into the Dongyao Wu's collaboration.
Commonwealth Scientific and Industrial Research Organisation
View shared research outputsCommonwealth Scientific and Industrial Research Organisation
View shared research outputs